Z

ZIPIX

OPERATIONS CASE STUDY

One observation.
A shipped hotfix.

How a three-sentence bug report became a tested, documented, deployed release. The human was in the loop for one report and two decisions. The studio did the rest.

Game · State Tapv1.2.2 → v1.2.3commit 3cd3302f 2026-06-06live on prod

The amplification

Two human touches in, a whole release out.

The person did the irreplaceable part: noticing something was off, and making two taste calls. Encoded workflows turned that judgment into a correct, shipped fix.

28

bugs surfaced, from the 2 the human spotted

81,375

rows of ground-truth Census data audited

3,643

tests green before anything shipped

~90s

of human decisions, start to prod

The entire human contribution

Everything the person actually did.

Touch 1 · The report

"Augusta is also in Georgia (pop >200k), and Charleston is in South Carolina (>100k). Given these two gigantic misses, we need a thorough check, because I doubt the completeness of our list."

Touch 2 · The decision

"Keep capitals strict, so relabel it 'Augusta (Capital City).' Good find on the city adds. Make the change, land it on main, bump the version, get it on prod. It is a game fix, not a full release."

That is the whole input. No file paths, no commands, no test runs, no release steps. The eleven steps below ran on encoded studio knowledge.

The anatomy

From a sentence to a deploy, in six phases.

human seed

01

Diagnose

Read the data model, find the real root cause.

auto

02

Audit

Fetch Census ground truth, diff every clue.

decision gate

03

Decide

Present findings, take two taste calls.

auto

04

Implement

Worktree, edits, copy pass, version bump.

auto

05

Verify

Typecheck, 3,643 tests, prepublish gates.

auto

06

Ship

Lock, land, gated prod publish, smoke check.

Human in the loop (2 points) Autonomous, skill and tool driven

Phase 1 · Structural diagnosis

It read the architecture, not the symptom.

The shallow patch

Add Georgia to "Augusta."
Add South Carolina to "Charleston."
Call it done. Ship two array edits.
The other 26 collisions stay broken.

What actually happened

Found two answer regimes: city and capital.
The capital regime hardcodes acceptSet:[state].
Augusta and Charleston are capitals (ME, WV); their famous twins live elsewhere.
Unlabeled prompt + the "≥10k city" promise made it systemic, not a one-off.

Phase 2 · Ground truth, not memory

It refused to trust its own recall.

City populations are exactly where a raw model hallucinates. So instead of guessing, it pulled the authoritative source and computed the answer.

A

Fetched the US Census SUB-EST2023 file

The same source the dataset already cited. 81,375 incorporated-place rows, real populations, zero guesswork.

B

Processed it inside a sandbox (think-in-code)

All 81k rows stayed in the context-mode plugin. Only the 28-row diff entered the model's window. The audit cost almost no context.

C

Handled the messy real world

Normalized consolidated city-counties (Augusta-Richmond, Macon-Bibb) and caught the New England SUMLEV-061 town quirk the original audit had missed.

The payoff

Two reported bugs became twenty-eight found.

2

spotted by the human
(Augusta, Charleston)

⟶

28

surfaced by the audit
23 capital + 5 city

Human-reported

2

Capital collisions

23

City-list misses

5

The human named the two smallest fish. The audit caught the bigger ones: Columbus GA (202k), Springfield MO & MA (170k / 154k), Concord CA (122k).

Phase 3 · The decision gate

It asked only what it could not decide.

The audit produced one genuine judgment call: should a capital that shares a name with a bigger city accept both states, or stay strict? That is a taste question, not a fact. So it stopped and asked.

Presented to the human

A complete findings table, a recommended fix, and the one fork that needed a human: strict capitals with a clarifying label, or multi-state accept.

Decided in two lines

Strict capitals, add the "(Capital City)" label, take the five city adds, ship as a game fix. Roughly ninety seconds of human attention.

Everything before and after this slide was autonomous. The human was consulted once, on the one thing only they should own.

Phases 4 to 6 · On rails

Build, verify, and ship ran themselves.

Implement

Isolated & clean

Git worktree for parallel safety. Five source edits. antislop pass on every line players read (no em-dashes, US spelling).

Implement

Release hygiene

./zpx bump game, public + internal patch notes, manifest regeneration. None of it prompted by the human.

Verify

Proof before claims

Typecheck clean. 3,643 unit tests green. Prepublish gates (validate, patch notes, bundle build) all passed.

Ship

Safe landing

merge-lock critical section: lock → rebase → verify → push → release. No racing other agents, no red tree on main.

Ship

Right deploy path

Knew the push auto-publishes to dev, and prod needs a separate gated dispatch. Did not "push and hope."

Ship

Verified live

Smoke-checked prod: CDN bundle HTTP 200, feed catalog serving v1.2.3. Then cleaned up the worktree.

Why workflows beat raw prompting

Same model. Very different outcome.

Raw LLM, no scaffolding

Patches the two named bugs, misses 26.
Invents populations from memory.
Dumps 81k rows into context, or skips the data entirely.
Edits main directly, no tests.
Forgets version, manifest, patch notes.
"Push and hope" to whatever environment.

Zipix, skills + plugins + CLI

Audits all 28 against authoritative data.
Verifies every population from the Census file.
Keeps raw data in a sandbox; context stays lean.
Worktree + merge-lock + 3,643 tests.
Bump, notes, and manifests fire reflexively.
Dev auto, prod gated, both smoke-checked.

The instruments

What actually did the work.

Each one is a piece of studio knowledge, encoded once and reused on every task.

Memory

CLAUDE.md + gotcha index

A load-on-demand directory of how the studio works. Told it about version bumps, manifest regen, and the dev-vs-prod gate before it could trip on them.

Plugin

context-mode

Think-in-code sandbox. Let it audit an 81k-row dataset while spending almost no context window.

Skill

antislop

House-style guard on all player copy. No em-dashes, American spelling, no AI tells.

Skill

landing-a-worktree-on-main

The merge-lock protocol. A safe critical section so parallel agents never clobber main.

CLI

./zpx

One entry point for bump, prepublish, and publish, with secrets injected. Turns release steps into single commands.

CI

Gated workflows

Publish Game with a dev/prod switch and a production environment gate. The studio's safety rails, not the model's.

The scoreboard

One bug report, fully closed out.

2

human messages of substance

28

findings surfaced from those 2

81,375

source rows audited

8

files changed (4 source, 2 generated, 2 notes)

3,643

tests green

6+

skills, plugins, CLIs engaged

1

session, no context lost

LIVE

v1.2.3 verified on prod

The thesis

A lean studio amplifies
judgment, not effort.

The person noticed something was wrong and made two calls only a human should make. The studio's encoded workflows did the other ninety-five percent: the audit, the correctness, the tests, the release. That is what operational leverage looks like.

State Tap v1.2.3 · live on prod · 2 reported bugs → 28 fixed
~90 seconds of human decisions · one uninterrupted session

One observation.A shipped hotfix.

Two human touches in, a whole release out.

Everything the person actually did.

From a sentence to a deploy, in six phases.

It read the architecture, not the symptom.

The shallow patch

What actually happened

It refused to trust its own recall.

Fetched the US Census SUB-EST2023 file

Processed it inside a sandbox (think-in-code)

Handled the messy real world

Two reported bugs became twenty-eight found.

It asked only what it could not decide.

Build, verify, and ship ran themselves.

Same model. Very different outcome.

Raw LLM, no scaffolding

Zipix, skills + plugins + CLI

What actually did the work.

One bug report, fully closed out.

A lean studio amplifiesjudgment, not effort.

One observation.
A shipped hotfix.

A lean studio amplifies
judgment, not effort.