All posts
Guides

Why We Killed the Reviewer Gate (and What We Use Instead)

ClaudeKit v2 dropped blocking reviewer/quality-gate agents. Commands now end with EVIDENCE — a diff, report, or verified file — not a gated loop. Here's why.

Updated 12 min read
Why We Killed the Reviewer Gate (and What We Use Instead)

ClaudeKit v1 ended every command with a blocking reviewer agent that scored output 0-100 and looped up to three times. We were proud of it. Then we killed it entirely for v2. The evidence-based approach — where every command ends with a concrete artifact (diff, report, verified file) instead of a reviewer gate — ships faster, costs less, and catches more real problems than a self-referential scoring loop ever did. Here is exactly what we built, why the gate failed, and what replaced it.

What was the reviewer gate and why did we build it?

The v1 gate was a genuine attempt to solve a real problem: LLM output is inconsistent. The same prompt yields a sharp draft on Tuesday and a generic one on Wednesday. There is no floor. You cannot build a repeatable workflow on output whose quality is a coin flip.

So we built a floor. Every deliverable-producing command in v1 ended with a dedicated reviewer agent — reviewer in what was then called FounderKit, quality-gate in MarketingKit, content-reviewer in SEOKit, cro-auditor in EcomKit. Each reviewer scored the output against a decomposed, weighted rubric (substance, structure, voice/fit, clarity, mechanics). Score below 90, the reviewer returned a line-level fix list. The producer agent revised. The gate re-ran. Maximum three iterations, then escalation.

On paper this is elegant. The author does not grade its own work. Targeted revision converges faster than a rewrite. The cap bounds runaway token cost. And for a while, in controlled conditions, it worked.

The failure modes only showed up at scale.

Why did the gate fail in practice?

Three problems compounded each other.

The reviewer graded what it could measure, not what mattered. A rubric that scores "substance / claims" can check whether a number is labeled as an estimate. It cannot check whether the number is wrong. It can penalize "weak CTA presence" but not whether the CTA makes sense for the buyer stage. The gate was very good at catching craft gaps and very bad at catching epistemic gaps — which are the ones that actually sink a deliverable.

Iteration loops hid input problems. When a draft could not clear 90 after three passes, it almost always meant the inputs were wrong, not the writing. Missing willingness-to-pay evidence, no proof points for a differentiation claim, context files that did not match the actual product. But before escalating, the loop had already burned tokens on three increasingly sophisticated attempts to polish something that was fundamentally misaimed. Users received a polished-but-hollow artifact more often than a useful escalation.

The reviewer became a cost center, not a value center. Running an Opus-tier reviewer agent on top of an already-heavy command context added 30-60% to the token cost of every deliverable command, every time. When we measured this (we use a tiktoken-compatible counter, roughly 4 chars/token), the reviewer was the single largest discretionary cost in most kit workflows. And because the reviewer ran even when the first draft was already good — which was the common case — most of that cost was pure overhead.

Here is a concrete look at what that overhead looked like across v1 kit workflows:

Kit (v1)Avg tokens without gateAvg tokens with gateGate overhead
FounderKit /raise~18,400~26,100+42%
MarketingKit /campaign~12,800~17,900+40%
SEOKit /content~11,200~15,600+39%
EcomKit /product-page~10,600~14,900+41%

The gate was adding roughly 40% to token cost while catching problems the user could have caught in 10 seconds by reading the output.

What does v2 use instead?

The v2 architecture is built on one principle: commands end with EVIDENCE, not a reviewer gate.

Every command in ClaudeKit v2 produces a concrete artifact that speaks for itself. Not a score. Not a gate verdict. An actual thing you can look at and evaluate without any intermediate agent's opinion:

  1. A unified diff showing exactly what changed and why
  2. A structured report with named findings, severity levels, and line references
  3. A verified file with explicit pass/fail checks run and logged inline
  4. A populated data table with sources cited per row

When /eng debug finishes, you get a root-cause diagnosis with the exact code change and a reproduction path — not a score of 94 from a reviewer agent. When /seo quick-wins finishes, you get a ranked table of positions 8-20 with low-CTR flags and the specific copy changes to test — not a "content quality" verdict. When /mkt voice finishes, you get a voice file built from your actual posts with 14 AI-tell patterns identified and stripped — not a rubric score.

The evidence is the quality signal. If the diff is wrong, you see it. If the report missed something obvious, you see it. A reviewer agent's job is to make a judgment call that you were going to make anyway — so we removed the middleman.

How do the v2 read-only agents work differently?

V2 still has agents. We have 13 of them across the 5 kits. The architecture is just completely different: they are all read-only specialists — reviewer, auditor, researcher roles that observe and report rather than gatekeep and loop.

Here is the distinction:

V1 reviewer gateV2 read-only agent
Blocks the command pipelineRuns in parallel or after, non-blocking
Scores output and triggers revision loopsObserves and annotates findings
Determines whether work shipsYou determine whether work ships
Adds 40% token overhead to every runInvoked explicitly when you want a second read
Hides input problems behind polish iterationsSurfaces input problems immediately

An example: /seo audit in v2 produces a structured findings report. A read-only SEO specialist agent can be invoked separately to add a second perspective on keyword intent alignment. That invocation is explicit, optional, and non-blocking. You run it when you want it. The primary command does not wait for it.

This matters for the engineering kit too. /eng review produces a diff with inline annotations. A read-only code reviewer agent can add a second pass on security-adjacent changes. The command ships its evidence either way. The agent adds signal; it does not control the gate.

What does this mean for the token budget?

The v2 kits ship with measured token counts, not estimates. We run ck tokens <kit> after every build and publish the results. Current totals:

  • EngineerKit: 20,413 tokens — 25 commands, 4 skills, 4 agents
  • MarketingKit: 16,714 tokens — 20 commands, 3 skills, 2 agents
  • VideoKit: 12,602 tokens — 17 commands, 5 skills, 3 agents
  • SEOKit: 16,004 tokens — 19 commands, 4 skills, 2 agents
  • EcomKit: 20 commands, 3 skills, 2 agents — 16,464 tokens

Total across all 5 kits: 101 commands, 19 skills, 13 agents, 82,197 measured tokens.

That token count is the whole product installed. No reviewer gate overhead on top of it. The ledger prints every time you run ck install <kit> so you know exactly what you are loading into context. For a deeper look at how we measure context cost and why token hygiene matters more than most people realize, the context cost measurement post has the methodology.

Is evidence-based output actually better quality?

This is the fair challenge to the v2 approach: the reviewer gate at least checked something. Evidence-based output just trusts the producing agent. How is that better?

Three reasons.

First, the evidence format forces specificity that a score cannot. A diff either shows the right change or it does not. A findings report either cites the right line or it does not. There is no "87 out of 100" ambiguity that lets a bad draft hide behind a near-miss score.

Second, the commands are built to produce evidence from the start, not written to draft output and then check it. /eng debug is architected as a root-cause investigation, not a "write a fix and grade it." The investigation structure is the quality mechanism. The evidence is the output of the investigation, not a post-hoc check on a draft.

Third, we calibrate against real baselines. /ecom no-sales compares store metrics against AOV-band benchmarks we maintain. /seo quick-wins uses actual position and CTR data from Search Console. /mkt humanize runs against a fixed list of 14 identified AI-tell patterns. These are not rubric scores from a separate agent; they are structured comparisons against named references. A comparison against a benchmark is harder to game than a rubric self-score.

The result: commands in v2 catch more real problems than the v1 gate did, at 40% lower token cost, without the false confidence of a score that could not verify its own inputs.

How does this change how you use ClaudeKit?

If you were using v1 kits, the workflow change is minimal. Commands still end with a deliverable. The deliverable is now more explicit about what was checked and how. You read it the same way you would read a diff or a report, which you were probably doing anyway.

If you are building your own Claude Code workflows, the architecture principle is portable:

  1. Design commands to produce evidence, not drafts to be graded. Root-cause investigation instead of write-then-check. Structured comparison against a named benchmark instead of a rubric score.
  2. Make read-only agents explicit and optional. If you want a second perspective, invoke a specialist agent. Do not wire it into the pipeline as a blocking gate.
  3. Publish your token costs. Run ck tokens or your own counter after every build. Hidden token costs are how workflows become expensive in ways users do not notice until the bill arrives.
  4. Cap escalation, not iteration. When a command cannot produce useful evidence — because the inputs are missing — it should say so immediately. Not after three polish passes. Missing inputs are a signal to the user, not a problem to iterate around.

For a deeper look at how agents, skills, and slash commands interact in the v2 architecture, see the agents vs skills vs slash commands breakdown. The model-tiering question — which commands run Haiku vs Sonnet vs Opus — is covered in the model-tiering guide.

What about commands where a second review genuinely helps?

Some commands benefit from a second read. We kept agents for exactly those cases, just without the blocking architecture.

/eng review optionally invokes a read-only code reviewer for security-adjacent diffs. /seo audit can invoke an intent-alignment specialist for ambiguous keyword clusters. /mkt voice can invoke a brand-consistency auditor against your voice file. These are all explicit invocations, not automatic gates. You ask for the second read when the stakes justify it.

The difference is who decides whether the work is good enough: you. Not a rubric agent running against a score you did not design and cannot audit. The reviewer gate was an attempt to remove human judgment from the quality loop. V2 is built on the opposite premise: the command produces the best evidence it can, and you exercise your judgment on that evidence. That is not a step backward. It is an honest division of labor between what an agent is good at (structured investigation, benchmark comparison, evidence production) and what you are good at (deciding whether this particular artifact serves this particular purpose).


If you want to see the evidence-based approach in practice, the EngineerKit is the most explicit demonstration of it — every command ends with a diff, a verified test run, or a structured report, and the 4 read-only agents are opt-in, not pipeline gates. The MarketingKit shows the same architecture applied to content: /mkt humanize ends with a before/after comparison against 14 named patterns, not a quality score. Start with whichever kit matches your stack at the pricing page.

FAQ

Why did the original reviewer gate seem to work at first?

In controlled conditions with well-specified inputs, the gate did raise the floor. When context files were complete and the deliverable type was well-defined, the 0-100 rubric caught real gaps and the revision loop converged in one or two passes. The failure modes only became visible at scale, with incomplete inputs and varied deliverable types, where the gate's blind spots (epistemic gaps, input problems disguised as craft problems) compounded faster than its strengths.

Does removing the gate mean quality is lower in v2?

No. Quality is higher because the commands are designed from the ground up to produce evidence rather than drafts to be graded. A root-cause investigation that ends with a diff and a reproduction path is higher quality than a draft-then-check loop. The evidence format is harder to fake than a rubric score, and the structured comparisons against named benchmarks catch real problems that a self-referential scoring agent missed entirely.

What happens when a v2 command encounters missing inputs?

Commands escalate immediately when critical inputs are absent, rather than iterating and polishing around the gap. /ecom no-sales will tell you it needs Search Console access before running the position analysis. /mkt voice will ask for sample posts before building the voice file. Immediate escalation is the honest design: you get the blocking reason right away, not after three token-heavy polish passes that could not fix the underlying gap.

How do the 13 read-only agents in v2 differ from the v1 reviewer agents?

V1 reviewer agents were wired into the command pipeline as blocking gates — the command could not emit output until the reviewer passed it. V2 agents are read-only specialists invoked explicitly, outside the primary command pipeline. They observe and annotate; they do not control whether work ships. You invoke them when you want a second perspective on high-stakes output. They add signal without adding pipeline friction or guaranteed token overhead.

Can I still get a quality score if I want one?

Yes. You can invoke a read-only specialist agent on any command output and ask it to produce a structured quality assessment. The difference is that this is an explicit, optional step you take when the stakes justify it, not an automatic gate that runs on every command regardless of stakes. For routine output you review the evidence directly. For high-stakes deliverables you add the specialist pass on top. You decide which commands warrant it.

How do I see the token cost before installing a kit?

Run ck tokens <kit> before ck install <kit>. This prints the full token ledger for every command, skill, and agent in the kit so you know exactly what will load into context. The install command also prints the ledger as part of its output. All five kit token counts are also published on the individual kit pages — EngineerKit, MarketingKit, VideoKit, SEOKit, EcomKit — so you can compare before committing.

Give Claude Code a real team

Five kits, 101 commands, every token measured. Pick the team that matches your work and install it in five minutes.

See the kits

Keep reading