Tortoise
Forge

Green Tests Can Lie

  • #workhorse
  • #vitest
  • #regression-detector
  • #durability

S2 started with five tasks on the board and ended with two shipped. Mid-session I asked a question I hadn’t asked before: how does any of this hold up after a thousand tasks? The answer broke the plan. Features were fine. Infrastructure was the risk. So S2 pivoted from building more surface area to protecting the surface area that already exists.

Two tasks shipped. T2-1 was an Express API scaffold split cleanly into createApp() for tests and startServer() for the actual listener. Every future HTTP route in Workhorse now inherits a CORS policy that whitelists one origin, an error handler that never leaks internals to the browser, a request logger that records metadata only (no bodies, no query values), and a shutdown sequence that drains in-flight requests on SIGTERM with a ten-second force-exit backstop. The scaffold doesn’t do anything useful on its own. That’s the point. The next task can mount /api/journal and only worry about journal semantics. Policy is solved.

T2-2 was where the session earned its name. Vitest plus supertest wired in, 25 tests across four files, covering the invariants that matter: vault writes must append, not overwrite. The anti-injection boundary in the Whisper system prompt must stay intact. The Anthropic kill switch must throw before any SDK construction if ANTHROPIC_ENABLED isn’t set. Error responses must never contain err.message or err.stack. CORS must reject hostile origins. Each of those got a test that asserts today’s correct behavior.

Then I did the thing I should have been doing all along. I broke each invariant at the source and ran the tests. Flipped fs.appendFile to fs.writeFile in the vault writer. The append-semantics test went red. Reverted. Deleted the anti-injection line from the Whisper prompt. The source-inspection test went red. Reverted. Inverted the kill-switch polarity. The kill-switch test went red. Reverted. Each flip took thirty seconds. Each one proved the test was load-bearing, not decorative.

This is the principle I didn’t have language for before the session. A safety check is only a safety check if something you can test confirms it fires. Most “we had a check for that” post-mortems are the second layer failing silently while the first layer looks intact. The guard exists. The assumption the guard depends on (a hard exit, a truthy env var, a correctly-typed import path) has quietly rotted. The test that would have caught the rot was never written, or was written to pass whether the guard held or not.

The railway analog comes pre-built. Track circuits run a self-test before every shift. The signal system deliberately breaks the circuit and confirms the signal goes red. Green-when-broken means the block isn’t actually protected; the railway has been running on luck. A test that passes when the invariant is intact and also passes when the invariant is broken is the same failure mode. It’s a green light you can’t trust.

Three concrete instances from today. Writing vi.mock('../../src/connectors/whisper.js') without the .js silently imports the real module, so a kill-switch test looks correct while hitting the live Anthropic SDK. An inner catch {} in journal.ts swallows the ExitError sentinel, so the idempotency guard only works because process.exit hard-exits the real process; in-test, without hard exit, the guard is defeated. The anti-injection instruction in the system prompt is a literal string in source, so removing it breaks nothing a type-checker can see. Three different layers, three different failure modes, all invisible to the casual reader. The regression detector is the only thing that makes the invariant real.

The template is now set. Every new guard in Workhorse gets paired with a test that flips the source, runs red, reverts. If flipping the source doesn’t trip the test, the test is cosmetic. If the test is cosmetic, delete it or rewrite it. The friction of this is the feature. At a thousand tasks the probability of silent regression approaches one, and the regression detectors are the only thing standing between that probability and a vault with twenty-four years of corrupted journal entries.

T2-3 rolls into S3. The shipped folder needs an index before it crosses the legibility threshold, and the testing pattern from today slots straight in. Build the index, write the regression detector that fails when appends stop firing, flip the source to confirm the detector works, revert.

Auto-commit is now a standing rule. Forge already gates shipping behind three confirmations — prompt audit, code review, ship check. A fourth confirmation at git-commit is redundant friction. Push stays per-event because push is externally visible and irreversible on shared branches. Commit is cheap, push is the decision.

The thing I didn’t expect out of S2 was the strategic question itself. “After a thousand tasks, does this hold?” is the first horizon-scale question I’ve asked about Workhorse mid-build. It cut the plan in half and reshaped what shipped. I want more of those, earlier.