Deep Dive: Worker Heartbeats and Job Recovery
A database-first ownership model for long-running jobs, with heartbeats, stale detection, and deterministic recovery.
On this page
Three hours into production, a worker crashed mid-analysis. JetStream dutifully redelivered the message. The new worker looked at the database, saw “processing,” and skipped it. The queue did its job. The system lost the work anyway.
That failure taught me where the real boundary lives. JetStream gives me at-least-once delivery, but delivery is not ownership. A worker can pull a message, crash, and leave the database thinking the job is still in flight. If a redelivery happens, the next worker sees an in-progress job that belongs to a dead process—and politely backs off.
The fix was to move ownership into the database and treat the queue purely as transport. The queue delivers messages; the database decides who owns the work. Once I drew that line, the recovery logic stopped feeling like a web of edge cases and started feeling like a small set of conditionals.
The Core Invariant
The smallest behaviour I need to protect: a job is owned by exactly one worker at a time. If that statement ever stops being true, the system leaks work. Everything in this article exists to make that invariant hold across crashes, restarts, and network partitions.
The Ownership Record
Jobs get two extra fields: a worker identity and a last-activity timestamp. That’s the entire heartbeat model.
ALTER TABLE jobs ADD COLUMN worker_id TEXT;
ALTER TABLE jobs ADD COLUMN last_activity_at TIMESTAMP WITH TIME ZONE;
CREATE INDEX idx_jobs_processing_activity
ON jobs(status, last_activity_at)
WHERE status = 'processing';
The index is the constraint: heartbeats must be cheap to query, or the recovery loop becomes the new bottleneck. A partial index on status = 'processing' keeps the scan tight even as the jobs table grows.
Worker Identity
A worker’s identity is hostname-pid-random. That choice is mechanical, not aesthetic.
- Hostname identifies the pod.
- PID distinguishes restarts inside the pod.
- Random bytes avoid collisions when processes restart fast.
const WORKER_ID = `${os.hostname()}-${process.pid}-${crypto.randomBytes(4).toString('hex')}`;
This is not globally unique, but it is unique enough to distinguish one process instance from the next, which is what recovery needs. A UUID would work, but this format makes debugging easier—I can see which pod a job belongs to without a lookup.
Claiming a Job Is a Conditional Update
Here is where the design either works or fails. Claiming is a single UPDATE ... WHERE that encodes the ownership rules. If the update touches no rows, you do not own the job.
async function claimJob(jobId) {
const result = await db.query(`
UPDATE jobs
SET worker_id = $1,
last_activity_at = NOW(),
status = 'processing'
WHERE id = $2
AND status IN ('queued', 'processing')
AND (
worker_id IS NULL
OR worker_id = $1
OR last_activity_at < NOW() - INTERVAL '5 minutes'
)
RETURNING id
`, [WORKER_ID, jobId]);
return result.rows.length > 0;
}
The WHERE clause is the contract. A worker can claim a job if:
- It is queued and unowned (
worker_id IS NULL) - The worker already owns it (re-entrancy, useful for retries)
- The previous owner stopped updating for 5 minutes (stale detection)
That single INTERVAL '5 minutes' bounds both recovery time and false positives. Make it shorter, and transient network blips trigger unnecessary re-queues. Make it longer, and real failures take too long to recover. Five minutes is a tuned value, not a guess—long enough for a git clone to stall on a slow network, short enough that users do not wait forever.
Heartbeats Piggyback on Progress
I do not run a separate heartbeat thread. Every progress update is a heartbeat, which means progress reporting is also the liveness signal.
async function updateProgress(jobId, stage, progress, message) {
await db.query(`
UPDATE jobs
SET stage = $1,
progress = $2,
message = $3,
last_activity_at = NOW()
WHERE id = $4
`, [stage, progress, message, jobId]);
}
If a worker is stuck, updates stop. That is the only failure mode I need to detect. There is no separate “I’m alive” message cluttering the system—the work itself is the proof of life.
This design has a constraint: jobs must make visible progress at least once every 5 minutes. For the Code Evolution Analyzer, that is easy—cloning and analyzing are both chatty operations. For a system with legitimately long silent stretches, you would need periodic no-op heartbeats. That trade-off did not apply here.
Startup Reconciliation
When a worker starts, it looks for jobs owned by previous incarnations of the same pod and clears them. The match is by hostname prefix, not by full ID, because the PID and random suffix change on restart.
async function reconcileOrphanedJobs() {
const hostPattern = `${os.hostname()}-%`;
const orphanedJobs = await db.query(`
SELECT id, repo_url, status, stage, worker_id
FROM jobs
WHERE worker_id LIKE $1
AND worker_id != $2
AND status IN ('queued', 'processing')
`, [hostPattern, WORKER_ID]);
for (const job of orphanedJobs.rows) {
await clearAndReenqueueJob(job.id, job.repo_url, 'Worker restarted');
}
}
This is not elegant; it is reliable. A worker that comes back will clean up its own mess before taking new work. The alternative—waiting for the 5-minute stale threshold—would leave jobs stranded longer than necessary after a pod restart.
Stale Job Detection
Every 30 seconds, any healthy worker scans for stale jobs and re-enqueues them. There is no leader election, no coordinator. The database is the arbiter.
const HEARTBEAT_INTERVAL_MS = 30 * 1000;
const STALE_THRESHOLD_MS = 5 * 60 * 1000;
setInterval(async () => {
const staleJobs = await db.query(`
SELECT id, repo_url, worker_id, last_activity_at
FROM jobs
WHERE status = 'processing'
AND (
last_activity_at IS NULL
OR last_activity_at < NOW() - INTERVAL '5 minutes'
)
ORDER BY last_activity_at ASC NULLS FIRST
LIMIT 100
`);
for (const job of staleJobs.rows) {
await clearAndReenqueueJob(job.id, job.repo_url,
`Worker ${job.worker_id} stopped responding`
);
}
}, HEARTBEAT_INTERVAL_MS);
The LIMIT 100 keeps the scan bounded. If the system is on fire, it should still make forward progress rather than trying to recover everything at once. The ORDER BY prioritises jobs that have been stuck longest.
Because every worker runs this scan, you might expect duplicates—and you would be right. Two workers can re-enqueue the same job. That is fine. The claim step makes it harmless.
Re-enqueue Is a Reset, Not a Resume
Clearing a job sets it back to queued state and republishes it to NATS. There is no partial resume. The system is honest about that constraint.
async function clearAndReenqueueJob(jobId, repoUrl, reason) {
const job = await db.query('SELECT status FROM jobs WHERE id = $1', [jobId]);
if (job.rows[0]?.status === 'completed' || job.rows[0]?.status === 'failed') return;
await db.query(`
UPDATE jobs SET
status = 'queued',
worker_id = NULL,
last_activity_at = NULL,
started_at = NULL,
message = $1
WHERE id = $2
`, [`Cleared: ${reason}`, jobId]);
await js.publish('jobs.analyze', JSON.stringify({
job_id: jobId,
repo_url: repoUrl,
cleared: true,
clear_reason: reason,
}));
}
The cleared: true flag lets downstream code know this is a retry, which is useful for metrics and debugging. I would love resumable jobs—checkpointing after each commit analysis—but without deterministic checkpoints the safest recovery is to start over. The repo cache makes re-cloning cheap, so the cost of a reset is bounded by the analysis time already spent, not by total job time.
Why Duplicates Are Harmless
Stale detection can race. Two workers can detect the same stale job and both re-enqueue it. JetStream may deliver the resulting message to two different workers. The database claim step makes that harmless because only one UPDATE ... WHERE will succeed.
Worker A: detects job 123 is stale, re-enqueues
Worker B: detects job 123 is stale, re-enqueues (race)
NATS: delivers job 123 to Worker C
NATS: delivers job 123 to Worker D (duplicate)
Worker C: claims job 123, UPDATE returns 1 row, proceeds
Worker D: claims job 123, UPDATE returns 0 rows, skips
The queue can deliver duplicates; the database refuses them. That division of responsibility is the whole design.
Tuning the Timers
The three numbers that matter:
STALE_THRESHOLD_MSis 5 minutes.HEARTBEAT_INTERVAL_MSis 30 seconds (the stale scan interval).- NATS
ack_waitis 10 minutes, set in nanoseconds.
The constraint is ordering. ack_wait must be longer than the stale threshold, otherwise you get a race between queue redelivery and database-driven recovery. If NATS redelivers before the database declares a job stale, you might have two workers both trying to claim—which the claim logic handles—but you also get confusing metrics about whether recovery came from the queue or the database.
Keeping NATS as the backstop (longer timeout) and the database as the primary recovery mechanism (shorter timeout) makes the behaviour predictable.
What This Still Does Not Solve
- Partial network failure: A worker that can reach NATS but not the database will look dead. That is acceptable; partial work should be retried. The alternative—treating queue acknowledgement as ownership—creates worse failure modes.
- Clock skew: Irrelevant because all timestamps come from
NOW()in PostgreSQL. Workers do not compare their local clocks. - Hung processes: A deadlock that keeps the process alive looks identical to a crash. The only safe assumption is: no progress means no worker. If you need to detect hung processes specifically, you would need out-of-band health checks, which this system does not have.
The Model That Holds
The queue delivers, the database owns. That is the mental model I hold during incidents, and it is simple enough to be correct under stress.
What surprised me was how much simpler the recovery logic became once I stopped treating the queue as the source of truth. JetStream is excellent at durable delivery, but durable delivery is not the same as exclusive ownership. The moment I gave ownership to the database, the implementation became a list of conditionals instead of a web of hope.
The system now recovers from worker crashes in under 6 minutes (5-minute threshold plus one scan interval), and the recovery path is the same whether a worker died, a pod restarted, or a network partition healed. One mechanism, predictable behaviour, no coordination.
See also: Building a Code Evolution Analyzer in a Weekend — the full project story