Building a Code Evolution Analyzer in a Weekend

Thursday afternoon I thought I’d write a quick script. By Friday evening I was debugging audio pulsing at 1am and adding console.log statements to trace oscillator gain ramps. Weekend projects have a way of doing this.

I wanted to watch a repo grow. Not the snapshot—how many lines now—but the journey: when did TypeScript overtake JavaScript, when did the test suite explode, which commit introduced the generated code that still haunts the language breakdown. The plan seemed simple enough: run a code counter on every commit, collect the numbers, draw a chart. (I’ve thought this about a lot of projects. It’s rarely true.)

The loop itself was honest enough:

// analyze.mjs (v0.1, the naive version)
import { execSync } from 'child_process';

const commits = execSync('git log --format="%H" --reverse')
  .toString().trim().split('\n');

for (const commit of commits) {
  execSync(`git checkout ${commit}`);
  const result = execSync('cloc . --json');
  // ... collect data
}

For a small repo, this runs fine. For a repo with ~10,000 files and ~2,000 commits, it runs for 20 minutes—and I sat there watching it, slowly realizing I had written about 20 million file reads into a for loop. That single number killed the entire design. If collection takes that long, nothing else matters: the rendering won’t ship, the web service won’t scale, and I’ll never actually use the tool.

The scc Discovery (or: Why I Should Have Searched First)

I’m a bit embarrassed about how long I stared at that loop before thinking to search “fast cloc alternative.” The answer was right there: scc (Sloc Cloc and Code, written in Go). I swapped one line, ran it again, and the same repo finished in about 15 seconds. That’s somewhere around 80× faster, from changing cloc to scc.

The difference is mechanical, I think—cloc is Perl and walks the tree serially, while scc is Go and parallelizes across cores. For historical analysis where you run the same counter thousands of times, the gap is the difference between “this is a fun tool” and “this is a coffee break.” (That commit happened Friday morning around 9

, which tells you how Thursday night went.)

With collection survivable, everything else could exist. Lines per language, files per language, bytes per language, with commit index as the clock. The visualization, the queuing, the WebSockets—all of that hangs off the loop being fast enough to sample every commit instead of every Nth and pretending.

The Chart That Wouldn’t Draw

So I pushed the visualization into a single HTML file—one artifact, no build step, portable output. Seemed elegant. But at 30fps with ~2,000 commits and ~15 languages, I was asking the browser to redraw tens of thousands of segments every frame, and it turns out Chart.js really doesn’t like that.

Chart.js is excellent for static charts; this is not that. This is a streaming animation where the dataset expands as the playhead advances. Chart.js redraws the full chart on every update, so even after I disabled animations and hid points, the frame drops were predictable. Around 11pm on Thursday I noticed the graph lines weren’t even showing up—the commit message says “fix: optimize chart updates to O(n) with incremental data points” but what it really means is “the chart was completely broken and I finally figured out why.”

I tracked the last rendered index and only appended new points instead of rebuilding everything. If someone scrubs backward, the whole chart resets, which isn’t ideal, but I’m honestly not sure how else to handle it without getting much more complicated.

Later I wrote a custom Canvas renderer that decimates data as it draws—bounding the number of segments regardless of commit count. Whether that was the right approach, I don’t know. If you want the details, I wrote up the Canvas renderer in a separate deep dive, though I’m still not entirely happy with how it handles zooming.

The Audio Rabbit Hole

This is where things got weird. I wanted to hear the repo evolve—each language becomes a voice, its line count becomes gain, the chord shifts as commits advance. The first version used triangle waves and just… sounded terrible. Harsh, buzzy, not at all what I had in mind.

What followed was about eight hours of audio debugging spread across Thursday night and Friday morning. The git log tells the story: at 12

I added audio sonification. By 1

I was fixing “continuous oscillators, add reverb, stop on completion.” At 1:19am: “switch to sine waves with detuning, fix volume initialization.” At 1

I switched to a major scale starting at C2 with per-commit proportions and 20% volume variation.

Then I went to bed, woke up, and discovered the voices weren’t staying assigned to the same languages. Each frame would re-sort and reassign frequencies, so JavaScript might be the bass note one moment and a higher pitch the next. The whole thing sounded chaotic instead of THX-like. (That was the goal—I wanted that slow-building chord where each tone swells independently. I’m not sure I fully achieved it.)

Friday morning at 8:48am: “fix: stable voice assignment per language for THX-like independent tone modulation.” Then at 9

I added console.log statements to trace the audio issues because I still couldn’t figure out why it sounded wrong. The logs helped—by 9

I’d removed them and bumped the volume variation to 20-100%.

But the worst bug came later. Around 4

Friday, I finally tracked down this pulsing/bursting sound that had been driving me crazy. The updateAudio() function was silencing ALL voices first, then setting the active ones. When frames update faster than the 50ms ramp time, each voice would briefly dip toward zero before recovering. The fix was to build a Set of active voice indices first and only silence the inactive ones. One of those bugs that takes hours to find and one line to fix.

At some point I wanted to share this thing—“paste a Git URL, get a visualization”—which meant the collection loop couldn’t run on a synchronous HTTP request. It’s CPU-bound and takes minutes for large repos. So I needed durable jobs, progress reporting, and isolation so one large repo doesn’t melt everything else.

The shape I landed on has an Express API talking to workers through NATS JetStream, with PostgreSQL for job tracking and Dragonfly (Redis-compatible) for caching and rate limits. It looks like overkill because it probably is, but it maps to the actual problems: jobs peg CPU cores so workers need their own scaling lane, jobs take minutes so the queue needs to survive restarts, and I need to detect duplicate repos.

One thing that surprised me—I tried Server-Sent Events first for progress updates. Behind Cloudflare, progress arrived in bursts instead of streaming. Some buffering behaviour in the proxy layer, I think. WebSockets fixed it, and the pipeline became simple: worker publishes to NATS, API forwards to the socket, browser updates a progress bar. That’s not about WebSockets being “better”; it’s about one edge case in one proxy layer dominating the whole experience. (I spent longer than I should have trying to make SSE work before giving up.)

Guardrails: The Moment You Remember User Input Exists

Accepting arbitrary Git URLs turns every request into an input validation exercise. If you’ve done any web security work this is obvious, but I managed to forget it until I tried some test URLs.

Path traversal in the result server was first—resolve to an absolute path, reject anything outside the results root. Command injection via the URL was second—the security module canonicalizes URLs, enforces HTTPS-only, blocks private IPs, and rejects shell metacharacters before anything reaches git clone. The clone itself runs with GIT_TEMPLATE_DIR empty, core.hooksPath=/dev/null, protocol restricted to HTTPS.

This isn’t a phase called “security hardening.” It’s the moment the system reminded me that string concatenation with user input is always a liability, even when you know better.

Thursday afternoon to Friday evening. The system started as a for loop and ended with distributed queues, a custom renderer, and audio synthesis that I’m still not entirely happy with. I never sat down to design any of this—I just kept hitting walls and building around them. Collection was too slow, so I found scc. The chart was broken, so I tracked indexes. The audio pulsed, so I traced gain ramps at 1am.

Whether the architecture makes sense or whether I just over-engineered a script, I honestly can’t tell yet. But it works, and I can watch repos grow now, which is what I wanted. (The audio still sounds a bit muddy on repos with many languages. I might revisit that.)

See also: Deep Dive: Audio Sonification — the Web Audio experiment
See also: Deep Dive: Canvas 2D Chart Rendering — replacing Chart.js for animation

The scc Discovery (or: Why I Should Have Searched First)

The Chart That Wouldn’t Draw

The Audio Rabbit Hole

Service Shape (If You Want to Share It)

Guardrails: The Moment You Remember User Input Exists