Back to Articles
2026 / 01
| 6 min read

Building a Code Evolution Analyzer in a Weekend

What started as a 'quick script to visualize git history' became microservices, NATS message queues, and audio sonification. The anatomy of a one-day project that wasn't.

code-evolution-analyzer microservices visualization javascript nodejs

Building a Code Evolution Analyzer in a Weekend

What started as “I’ll just make a quick script” became microservices and audio sonification


I wanted to see how a codebase evolved over time. Not just “here’s the current line count” but the actual journey: when did JavaScript take over from Python? When did that test suite explosion happen? What commit added all that generated code?

The plan was simple: run cloc on each commit, collect the numbers, make a chart. Maybe an afternoon of work.

It was not an afternoon of work.

Day One: The CLI Tool

The first version was genuinely simple:

// analyze.mjs (v0.1, the naive version)
import { execSync } from 'child_process';

const commits = execSync('git log --format="%H" --reverse')
  .toString().trim().split('\n');

for (const commit of commits) {
  execSync(`git checkout ${commit}`);
  const result = execSync('cloc . --json');
  // ... collect data
}

This worked! On small repos. On anything with more than a few hundred commits, it was… slow. Like, “go make coffee and come back in twenty minutes” slow.

The problem was cloc. It’s a fantastic tool, but it’s Perl, and it walks the entire file tree every time. For a repo with 10,000 files and 2,000 commits, that’s 20 million file reads.

(I should have done the math before writing the loop. I did not do the math.)

Discovering scc

A quick search for “fast cloc alternative” led me to scc — Sloc Cloc and Code, written in Go. The benchmark claims were wild: 10-100x faster than cloc.

I swapped cloc for scc:

const result = execSync('scc . --format json');

The 20-minute analysis became 15 seconds. Not a typo: 80x faster. scc uses parallel file processing and keeps state efficiently. For historical analysis where you’re running the same counter thousands of times, the difference is staggering.

Day One, Part Two: The Visualization

With data collection fast enough to be usable, I needed to visualize it. The obvious choice was Chart.js — well-documented, widely used, plenty of examples.

I embedded the entire visualization into the CLI tool as a template literal. One file, one output. No build step, no dependencies for the generated HTML.

const html = `<!DOCTYPE html>
<html>
<head>
  <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
  <style>
    /* 200 lines of CSS */
  </style>
</head>
<body>
  <canvas id="chart"></canvas>
  <script>
    const data = ${JSON.stringify(analysisData)};
    // 400 lines of Chart.js setup
  </script>
</body>
</html>`;

This worked! The visualization showed language breakdown over time, with an animated playback that stepped through commits. Very satisfying to watch a codebase grow.

But Chart.js has a problem: it redraws the entire chart on every update. At 30fps playback with 15 languages and 2,000 data points, that’s a lot of canvas operations. The animation stuttered. Badly.

(We’ll come back to this problem later. Spoiler: I ended up writing my own chart renderer.)

Day Two: The Web Service

I wanted to share this tool. Not as a CLI that requires Node.js and Git installed, but as a web service: paste a GitHub URL, get a visualization.

“Just wrap the CLI in an Express server,” I thought. “Maybe add a queue for long-running jobs.”

Reader, I built microservices.

The Architecture

What started as “Express + the CLI” became:

Browser (Lit + Tailwind)
    │ HTTP/WebSocket

API Service (Express 5) ──NATS JetStream──▶ Worker(s)
    │                    (jobs.analyze)        │
    │                                          │
    ├───────────────PostgreSQL─────────────────┤
    │                                          │
    └───────────────Redis (Dragonfly)──────────┘
                  (caching)

Why this complicated? Because I kept running into problems that each component solved:

PostgreSQL: I needed to track which repositories had already been analyzed. SQLite would have worked, but I wanted proper job history and querying.

NATS JetStream: Long-running analysis jobs (some repos take 5+ minutes) needed to survive server restarts. NATS provides durable queues with at-least-once delivery.

Workers: One analysis job maxes out a CPU core. To handle concurrent requests, I needed separate worker processes that could scale independently.

Redis/Dragonfly: Rate limiting, caching site statistics, deduplicating in-progress jobs. Small things that add up.

(Did I need all of this for a side project? No. Did I learn a lot building it? Yes. That’s the real answer.)

The WebSocket Progress Problem

Analysis jobs take 30 seconds to 10 minutes depending on repository size. Users need progress feedback, not just “processing…”

The first attempt used Server-Sent Events (SSE). This worked until I deployed behind Cloudflare, which aggressively buffers SSE streams. Progress updates would arrive in batches of 10 instead of one at a time.

WebSockets don’t have this problem. The browser opens a persistent connection, the server sends progress updates immediately, everyone’s happy.

// services/api/lib/websocket.js
wss.on('connection', (ws) => {
  ws.on('message', (data) => {
    const { type, jobId } = JSON.parse(data);
    if (type === 'subscribe') {
      subscriptions.set(ws, jobId);
    }
  });
});

// When progress arrives from NATS
nats.subscribe('progress.*', (msg) => {
  const { jobId, stage, percent } = JSON.parse(msg.data);
  for (const [ws, subscribedId] of subscriptions) {
    if (subscribedId === jobId) {
      ws.send(JSON.stringify({ stage, percent }));
    }
  }
});

The worker publishes progress to NATS, the API service subscribes and forwards to the right WebSocket connections. Clean separation, no direct worker-to-browser communication needed.

Day Three: Security Hardening

When you accept arbitrary Git URLs from the internet, you quickly discover all the ways people try to break things.

Path Traversal

The first serious bug was a path traversal vulnerability. The API stored analysis results at /results/{job_id}/visualization.html and served them directly. But what if someone submitted a job and then requested /api/jobs/../../../etc/passwd/visualization?

// BEFORE: vulnerable
app.get('/api/jobs/:id/visualization', (req, res) => {
  res.sendFile(`/results/${req.params.id}/visualization.html`);
});

// AFTER: safe
app.get('/api/jobs/:id/visualization', (req, res) => {
  const resultsDir = path.resolve('/results');
  const filePath = path.resolve(resultsDir, req.params.id, 'visualization.html');
  
  if (!filePath.startsWith(resultsDir)) {
    logger.warn({ path: req.params.id }, 'Path traversal attempt');
    return res.status(400).json({ error: 'Invalid job ID' });
  }
  
  res.sendFile(filePath);
});

Always normalize and validate paths before file access. Log the attempts so you know someone’s probing.

Git Command Injection

Similar issue with repository URLs. The worker runs git clone <url>, which means shell injection if you’re not careful:

// BEFORE: vulnerable
execSync(`git clone ${url} ${workdir}`);

// AFTER: safe
execSync('git clone -- ' + sanitizedUrl + ' ' + workdir, {
  timeout: 300000,
  env: {
    ...process.env,
    GIT_TERMINAL_PROMPT: '0', // Prevent credential prompts
  },
});

The -- tells git “everything after this is a positional argument, not a flag.” Combined with URL validation (only allow known Git providers), this prevents most injection attacks.

Content Security Policy

The generated visualizations include JavaScript. If someone found a way to inject content into the analysis output, they could run arbitrary code in viewers’ browsers.

CSP headers prevent this:

res.setHeader('Content-Security-Policy', 
  "default-src 'self'; " +
  "script-src 'self' 'unsafe-inline' cdn.jsdelivr.net; " +
  "style-src 'self' 'unsafe-inline' fonts.googleapis.com; " +
  "font-src fonts.gstatic.com"
);

The unsafe-inline for scripts is unfortunate (the visualization is a single HTML file with embedded JS), but at least external scripts are limited to trusted CDNs.

Day Four: The Audio Experiment

By this point I had a working web service. Time to make it weird.

I added audio sonification: as the visualization plays, each programming language becomes a voice in a dynamic chord. More code = louder voice. The result sounds like a slowly evolving synthesizer pad.

(This is documented in detail in the Audio Sonification Deep Dive.)

The short version: Web Audio API, sawtooth oscillators with low-pass filtering, and some music theory about harmonic series. It’s completely unnecessary and I love it.

What I Actually Learned

Beyond the specific technologies, building this taught me a few things:

Scope creep is fine if you’re learning. This project exists to teach me things. Every “unnecessary” component was a learning opportunity.

scc over cloc for batch processing. The 80x speedup isn’t academic — it’s the difference between “usable” and “unusable.”

NATS is remarkably simple. Compared to Kafka or RabbitMQ, NATS JetStream has almost no operational complexity. nats-server -js and you’re done.

Path validation is security 101. I should have added it from the start. I didn’t. Lesson learned.

WebSockets beat SSE for real-time. When you control both ends and need immediate delivery, WebSockets have fewer edge cases.

Try It Yourself

The service is live at analyze.devd.ca. Paste any public Git repository URL and watch your code evolve.

The CLI tool is on GitHub: slepp/cloc-history-analyzer. Run it locally on your private repos.

Turn on the sound. Trust me.


See also: Deep Dive: Audio Sonification — the Web Audio API experiment

See also: Deep Dive: Canvas 2D Chart Rendering — replacing Chart.js for performance