ZIPIX
OPERATIONS CASE STUDY

One observation.
A shipped hotfix.

How a three-sentence bug report became a tested, documented, deployed release. The human was in the loop for one report and two decisions. The studio did the rest.

Game · State Tapv1.2.2 → v1.2.3commit 3cd3302f 2026-06-06live on prod
The amplification

Two human touches in, a whole release out.

The person did the irreplaceable part: noticing something was off, and making two taste calls. Encoded workflows turned that judgment into a correct, shipped fix.

28
bugs surfaced, from the 2 the human spotted
81,375
rows of ground-truth Census data audited
3,643
tests green before anything shipped
~90s
of human decisions, start to prod
The entire human contribution

Everything the person actually did.

Touch 1 · The report

"Augusta is also in Georgia (pop >200k), and Charleston is in South Carolina (>100k). Given these two gigantic misses, we need a thorough check, because I doubt the completeness of our list."

Touch 2 · The decision

"Keep capitals strict, so relabel it 'Augusta (Capital City).' Good find on the city adds. Make the change, land it on main, bump the version, get it on prod. It is a game fix, not a full release."

That is the whole input. No file paths, no commands, no test runs, no release steps. The eleven steps below ran on encoded studio knowledge.

The anatomy

From a sentence to a deploy, in six phases.

human seed
01
Diagnose
Read the data model, find the real root cause.
auto
02
Audit
Fetch Census ground truth, diff every clue.
decision gate
03
Decide
Present findings, take two taste calls.
auto
04
Implement
Worktree, edits, copy pass, version bump.
auto
05
Verify
Typecheck, 3,643 tests, prepublish gates.
auto
06
Ship
Lock, land, gated prod publish, smoke check.
Human in the loop (2 points) Autonomous, skill and tool driven
Phase 1 · Structural diagnosis

It read the architecture, not the symptom.

The shallow patch

  • Add Georgia to "Augusta."
  • Add South Carolina to "Charleston."
  • Call it done. Ship two array edits.
  • The other 26 collisions stay broken.

What actually happened

  • Found two answer regimes: city and capital.
  • The capital regime hardcodes acceptSet:[state].
  • Augusta and Charleston are capitals (ME, WV); their famous twins live elsewhere.
  • Unlabeled prompt + the "≥10k city" promise made it systemic, not a one-off.
Phase 2 · Ground truth, not memory

It refused to trust its own recall.

City populations are exactly where a raw model hallucinates. So instead of guessing, it pulled the authoritative source and computed the answer.

A

Fetched the US Census SUB-EST2023 file

The same source the dataset already cited. 81,375 incorporated-place rows, real populations, zero guesswork.

B

Processed it inside a sandbox (think-in-code)

All 81k rows stayed in the context-mode plugin. Only the 28-row diff entered the model's window. The audit cost almost no context.

C

Handled the messy real world

Normalized consolidated city-counties (Augusta-Richmond, Macon-Bibb) and caught the New England SUMLEV-061 town quirk the original audit had missed.

The payoff

Two reported bugs became twenty-eight found.

2
spotted by the human
(Augusta, Charleston)
28
surfaced by the audit
23 capital + 5 city
Human-reported
2
Capital collisions
23
City-list misses
5

The human named the two smallest fish. The audit caught the bigger ones: Columbus GA (202k), Springfield MO & MA (170k / 154k), Concord CA (122k).

Phase 3 · The decision gate

It asked only what it could not decide.

The audit produced one genuine judgment call: should a capital that shares a name with a bigger city accept both states, or stay strict? That is a taste question, not a fact. So it stopped and asked.

Presented to the human

A complete findings table, a recommended fix, and the one fork that needed a human: strict capitals with a clarifying label, or multi-state accept.

Decided in two lines

Strict capitals, add the "(Capital City)" label, take the five city adds, ship as a game fix. Roughly ninety seconds of human attention.

Everything before and after this slide was autonomous. The human was consulted once, on the one thing only they should own.

Phases 4 to 6 · On rails

Build, verify, and ship ran themselves.

Implement
Isolated & clean
Git worktree for parallel safety. Five source edits. antislop pass on every line players read (no em-dashes, US spelling).
Implement
Release hygiene
./zpx bump game, public + internal patch notes, manifest regeneration. None of it prompted by the human.
Verify
Proof before claims
Typecheck clean. 3,643 unit tests green. Prepublish gates (validate, patch notes, bundle build) all passed.
Ship
Safe landing
merge-lock critical section: lock → rebase → verify → push → release. No racing other agents, no red tree on main.
Ship
Right deploy path
Knew the push auto-publishes to dev, and prod needs a separate gated dispatch. Did not "push and hope."
Ship
Verified live
Smoke-checked prod: CDN bundle HTTP 200, feed catalog serving v1.2.3. Then cleaned up the worktree.
Why workflows beat raw prompting

Same model. Very different outcome.

Raw LLM, no scaffolding

  • Patches the two named bugs, misses 26.
  • Invents populations from memory.
  • Dumps 81k rows into context, or skips the data entirely.
  • Edits main directly, no tests.
  • Forgets version, manifest, patch notes.
  • "Push and hope" to whatever environment.

Zipix, skills + plugins + CLI

  • Audits all 28 against authoritative data.
  • Verifies every population from the Census file.
  • Keeps raw data in a sandbox; context stays lean.
  • Worktree + merge-lock + 3,643 tests.
  • Bump, notes, and manifests fire reflexively.
  • Dev auto, prod gated, both smoke-checked.
The instruments

What actually did the work.

Each one is a piece of studio knowledge, encoded once and reused on every task.

Memory
CLAUDE.md + gotcha index
A load-on-demand directory of how the studio works. Told it about version bumps, manifest regen, and the dev-vs-prod gate before it could trip on them.
Plugin
context-mode
Think-in-code sandbox. Let it audit an 81k-row dataset while spending almost no context window.
Skill
antislop
House-style guard on all player copy. No em-dashes, American spelling, no AI tells.
Skill
landing-a-worktree-on-main
The merge-lock protocol. A safe critical section so parallel agents never clobber main.
CLI
./zpx
One entry point for bump, prepublish, and publish, with secrets injected. Turns release steps into single commands.
CI
Gated workflows
Publish Game with a dev/prod switch and a production environment gate. The studio's safety rails, not the model's.
The scoreboard

One bug report, fully closed out.

2
human messages of substance
28
findings surfaced from those 2
81,375
source rows audited
8
files changed (4 source, 2 generated, 2 notes)
3,643
tests green
6+
skills, plugins, CLIs engaged
1
session, no context lost
LIVE
v1.2.3 verified on prod
The thesis

A lean studio amplifies
judgment, not effort.

The person noticed something was wrong and made two calls only a human should make. The studio's encoded workflows did the other ninety-five percent: the audit, the correctness, the tests, the release. That is what operational leverage looks like.

State Tap v1.2.3 · live on prod · 2 reported bugs → 28 fixed
~90 seconds of human decisions · one uninterrupted session
ZIPIX · Lean Ops Case Study
01 / 13
← → or space to navigate