Six Months Building ClaudeKit in Public: What We Shipped, What We Killed, and the Numbers Behind v2

Six months ago we shipped v1 of ClaudeKit with six kits, 461 so-called skills, 88 agents, and a lot of architectural decisions we were confident about. We were wrong on most of them. v2 ships with 5 kits, 101 commands, 19 skills, 13 read-only agents, and 82,197 measured tokens — roughly one-fifth the bloat, every number summed from source files, nothing invented for a marketing page. Here is what changed, why, and what the honest numbers look like.

What exactly shipped in v2?

Five vertical kits for Claude Code, each installed in under 60 seconds via ck install <kit> or the plugin marketplace:

Kit	Commands	Skills	Agents	Tokens
EngineerKit `/eng`	25	4	4	20,413
MarketingKit `/mkt`	20	3	2	16,714
SEOKit `/seo`	19	4	2	16,004
EcomKit `/ecom`	20	3	2	16,464
VideoKit `/video`	17	5	3	12,602
Total	101	19	13	82,197

These figures are summed from the kit manifests, not approximated. They are the same numbers the CLI prints when you run ck tokens <kit>. The token count is what loads into Claude Code's context window at session start — it is the tax you pay per conversation.

Why did we kill FounderKit and SalesKit?

v1 shipped with FounderKit and SalesKit as two of the six verticals. We shelved both before v2 went out.

The honest reason: they tried to do too much with too little specificity. A "founder" is not a coherent operator persona at the workflow level — they are sometimes a marketer, sometimes an engineer, sometimes running ecommerce, always context-switching. The commands we built for FounderKit (/founder raise, /founder pitch) produced outputs that were generically competent and specifically useless. When we tested them against real founders, the feedback was consistent: "this reads like GPT-4 wrote it at 2am." SalesKit had the same problem in a different direction — sales workflows need deep CRM integration that a slash-command kit cannot substitute for.

Killing them was the right call. The five remaining verticals each map to a coherent, repeatable daily workflow. EngineerKit's /eng debug runs root-cause-first, not symptom-first. EcomKit's /ecom no-sales runs a triage against AOV-band benchmarks. These are workflows an operator runs every week, not a one-time pitch deck generator.

How did we go from 461 skills to 19?

This is the question that gets asked the most. v1's 461-skill count was real — they existed in the repos — but it was measuring the wrong thing. Most of those "skills" were prompt fragments between 200 and 800 tokens that we were auto-loading into every session whether the command needed them or not. We measured the actual token overhead in production and found that a typical v1 command was loading 35,000-60,000 tokens of context before the user typed a single character.

We rebuilt from a different question: what does this command actually need to run well? The answer, almost every time, was one focused knowledge file (now called a skill) that loads only when the relevant command runs. Not a library of 80 fragments hoping the right one fires.

The result: v2 commands run at a fraction of the context cost. We documented the methodology in The Real Cost of Free Skills — the same discipline applies to us. Our numbers are measured estimates labeled as such, not marketing assertions.

What changed about the agent architecture?

v1 used orchestrator agents and reviewer/quality-gate agents. The pattern looked like this: a command would run, then hand off to a reviewer agent, which would either approve or send the work back. It felt rigorous. In practice it was theatrical. The reviewer agents had no ground truth to review against — they were just running another LLM pass that added latency and tokens without adding correctness.

v2 uses 13 read-only specialist agents across the five kits. These are researcher, auditor, and reviewer roles that:

Pull from real data sources (DataForSEO, Firecrawl, GSC, Ahrefs for SEOKit; Remotion render outputs for VideoKit)
Produce an EVIDENCE artifact — a report, diff, or verified file — not a pass/fail gate
Run once, not in a loop

The key distinction is that commands in v2 end with evidence, not approval. /eng verify produces a test run report. /seo audit produces a scored findings file. /video clone produces a verified render. The output is the accountability mechanism, not a gatekeeper agent deciding whether the output is good enough.

No orchestrator agents. No blocking reviewer gates. No runnable Python tools. No --demo flags. No examples/ folders. Those were all v1 patterns we cut.

Does the pricing math still hold at these numbers?

Yes, and the model got simpler:

Tier	Monthly	Annual	One-Time
Single kit	$14.99/mo	$119/yr	$99 lifetime (as shipped)
Pro (any 3, swap 1/cycle)	$29.99/mo	$239/yr	—
All-Access (all 5 kits)	$49.99/mo	$399/yr	—

The 14-day refund window (not 30 days — we corrected this in v2 docs) applies to all tiers. 3 devices per license. Lifetime is priced per kit as shipped — it does not include future kits or updates.

The competitive math is the same as v1 but now it is honest: the $99 lifetime for a single kit is priced at what the main competitor charges for one gated kit, but you own the install permanently and you can see the token cost before you spend a single token on it. Full tier breakdown is on the pricing page.

How does install actually work now?

Two paths, both under 60 seconds:

CLI path:

npm install -g claudekits   # v0.1.3
ck auth <your-key>
ck install engineer         # or marketing, seo, ecom, video

Install defaults to global (~/.claude). Pass --local to scope it to a project. The token ledger prints on every install. ck tokens <kit> recounts; ck doctor diagnoses path and config issues; ck list shows your entitlements.

Plugin marketplace path:

/plugin marketplace add Madni-Aghadi/claudekit-engineer

Same manifest, same token count, same commands — just installed directly from inside Claude Code without touching the terminal. This path landed in v2 after the Agent Skills open standard was adopted in December 2025. The standard has since been adopted by 32+ tools, and the skills ecosystem grew 18.5x in 20 days after launch — external validation that the install surface we built toward was the right one.

What are the flagship commands in each kit?

Each kit has one or two commands we consider the reason to install it. These are the daily-driver workflows, not demos:

EngineerKit daily 8: /eng catchup, /eng plan, /eng tdd, /eng debug, /eng verify, /eng review, /eng commit, /eng handoff. Plus /eng ship and /eng fix-issue for the full sprint loop. The flagship is /eng debug — root-cause-first triage, not symptom chase.

MarketingKit flagships: /mkt voice builds a voice file from your actual published posts (not a persona template). /mkt humanize strips 14 measurable AI tells from copy. Also: /mkt hooks, /mkt repurpose (one piece to five formats), /mkt thread, /mkt newsletter, /mkt calendar, /mkt launch.

SEOKit flagships: /seo quick-wins targets positions 8-20 plus low-CTR pages — the easiest ranking gains most audits ignore. /seo citations runs N-pass AI citation measurement with confidence intervals. Also: /seo audit, /seo write, /seo check, /seo pseo, /seo extractable.

EcomKit flagship: /ecom no-sales runs a full store triage against AOV-band benchmarks — it answers why you are not selling, not just what your conversion rate is. Also: /ecom flows, /ecom cart-recovery, /ecom amazon, /ecom margin, /ecom ads, /ecom reviews, /ecom bfcm.

VideoKit flagship: /video clone recreates a reference video's style in Remotion and verifies the match against the original. Also: /video make, /video demo, /video caption, /video data, /video social.

What does "build in public" actually obligate us to?

The phrase is easy to hollow out into a content aesthetic. For us it has a structural meaning tied to the manifest architecture: the site, CLI, and token ledger all derive from the same source files. We cannot inflate a command's capability on a marketing page while the CLI installs something different — there is one source of truth. When the site says EngineerKit loads 20,413 tokens, that is what ck tokens engineer returns. The constraint is the accountability mechanism.

It also sets a standard for everything we publish. When we say a command runs "in roughly 3 minutes on a mid-tier Sonnet call," that is a labeled planning estimate from logged runs, not a benchmark we are guaranteeing. We separate three categories strictly: measured fact (token counts, command counts, prices), planning estimate (workflow times, token spend per run), and roadmap (not yet shipped). "In public" means being explicit about which category each number falls into.

This discipline matters more now that "Claude Code specialist" job demand is up 938% since June 2026 and Claude Code accounts for roughly $2.5B of Anthropic's revenue. The market is real and getting crowded. The operators entering it deserve tooling that is honest about what it costs and what it does.

What is on the roadmap for the rest of 2026?

Three priorities, in order:

MCP data integration across all kits. SEOKit agents already pull from DataForSEO, Firecrawl, Ahrefs, and Google Search Console. The same real-data pattern needs to extend to EcomKit (Shopify, Klaviyo) and VideoKit (YouTube Analytics). Commands that reason from real data produce better output than commands that reason from memory.
Team licenses. All current tiers are individual. Shared kit access with seat management is the most-requested direction from small agencies and dev teams. The manifests make this straightforward technically; pricing and device-count rules are the part we are still working through.
Measured Lighthouse benchmarks. The site is built to a Core Web Vitals budget. We will publish measured scores, not asserted performance. LCP target under 1.8s, CLS under 0.05.

None of these are shipped today. They are the stated direction, not the current state.

FAQ

What happened to FounderKit and SalesKit?

Both were shelved before v2 shipped. The workflows in FounderKit and SalesKit mapped to personas (founders, salespeople) rather than to repeatable daily tasks, which produced outputs that were generically competent but not specifically useful in real operator workflows. The five remaining kits each have a clear daily-driver use case. We may revisit founder-specific commands as part of EngineerKit or MarketingKit if the demand is specific enough to warrant focused tooling.

Why does v2 have only 101 commands when v1 had 461 skills?

Because we were measuring the wrong thing in v1. The 461-skill count measured prompt fragments loaded into context, most of which were unnecessary overhead on any given command run. v2 measures commands — discrete, completable workflows that produce an evidence artifact. 101 commands that run well beats 461 fragments that load whether you need them or not.

How are the 82,197 token figures measured?

We count at pack time using a tiktoken-compatible counter (roughly 4 characters per token). The figure represents the total context loaded by all commands and skills in a kit if you ran them all simultaneously — a ceiling, not a per-command number. Individual commands load a fraction of this. ck tokens <kit> reruns the count from source so you can verify it yourself.

Is the 14-day refund window real?

Yes. 14 days from purchase, no questions asked. We corrected earlier marketing copy that said 30 days — it was never 30 days, and we updated the docs when we caught the error. The policy applies to all tiers including lifetime per-kit purchases. 3 devices per license.

What is the difference between a command, a skill, and an agent in v2?

Commands are slash workflows — you invoke them explicitly and they produce an evidence artifact (report, diff, verified file). Skills are auto-loading knowledge files that attach to relevant commands at session start. Agents are read-only specialists (researcher, auditor, reviewer) that pull from real data sources and produce findings but do not gate or approve other commands. The deeper breakdown is in the guide to Claude Code skills.

Do the kits work with Claude Code's plugin marketplace or only the CLI?

Both. ck install <kit> installs globally to ~/.claude (or locally with --local). /plugin marketplace add Madni-Aghadi/claudekit-<kit> installs directly from inside Claude Code. Same manifest, same token count, same commands either way. The marketplace path is newer and slightly faster for users already inside a Claude Code session.

If you are deciding where to start, the answer usually comes down to what you do most days. Engineers spending three or more hours in Claude Code sessions should look at EngineerKit first — the daily-8 loop alone pays for itself in reduced context-switching. Marketers producing content at volume should look at MarketingKit for the voice file and humanize pipeline. Store operators should start with EcomKit and run /ecom no-sales before anything else — it will tell you where to look.