I Built a Production PWA in 5 Days. One AI Fleet Built It. Another's Job Was to Break It.
Here's what happened when I put two AI agent systems in the same codebase with opposite goals.
SafePetPlate is a free app that lets pet owners instantly check if a food is safe for their dog or cat — stoplight-style: TOXIC, CAUTION, or SAFE. Works offline, 2 locales, 193 foods in the database backed by veterinary sources.
I built the entire thing in 5 days using two AI agent fleets: Claude Flow (the builders) and Agentic QE Fleet (the quality enforcers). And I want to be transparent about what that actually looked like — not just the wins, but the moments where 91% test coverage masked two broken search bugs and 2,313 lines of tests that tested nothing.
The Two Fleets
Claude Flow orchestrates multi-agent swarms with a queen-coordinator model. One orchestrator decomposes tasks and dispatches to 5–8 specialized workers running in parallel. For each sprint — Data Layer, Core UI, PWA, i18n, SEO, Testing — it would spin up a swarm and ship features in 20–30 minutes.
Agentic QE Fleet runs a parallel hierarchy with a completely different mandate: find every bug, gap, and lie in what Claude Flow built. It includes 13 specialized agents — coverage specialists, security reviewers, performance analyzers, mutation testers, a “Devil's Advocate” that challenges other agents' findings — all running concurrently.
The key: they share memory, but not goals.
The Numbers
The Story Nobody Posts About
After Sprint 2, Claude Flow had delivered 666 tests with 91% coverage. I could have shipped.
The QE fleet ran a brutal-honesty review and found that 6 test files (2,313 lines) imported local helper functions that reimplemented the component behavior and tested those helpers — not the actual production code. Coverage tools counted the lines as covered. The tests were structurally correct. They were also completely useless.
The QE fleet deleted all 6 files.
Then — while manually testing the live app — two search bugs surfaced that 91% coverage had completely missed:
- Searching “fish” returned “mushrooms”
- Searching “salmon” returned “almonds”
The bug: tests used a 6-food synthetic dataset where these fuzzy-match collisions couldn't exist. The real 193-food database had enough similar character sequences to confuse the algorithm. No automated agent caught this. A human did.
Four review rounds later: threshold tightened, 3-character minimum added, word-boundary prefix constraint added, tests rewritten against real data.
Test count dropped from 666 to 500. The 500 actually meant something.
The Moment That Defined the Collaboration
Sprint 4 was i18n. Claude Flow delivered locale switching in 30 minutes. Tests passed. The user's language preference was saving to localStorage.
Problem: next-intl middleware runs server-side. It reads the NEXT_LOCALE cookie. It has zero access to localStorage.
The language preference was being saved somewhere nothing could read it. The test suite had no way to catch this — there was no integration test that ran the middleware against the saved state.
QE fleet caught it in structural review. 4.5 minutes, one agent, rewired both components to use the next-intl router with proper cookie management.
This is the class of bug that AI builds with confidence: locally correct, systemically wrong. The code does exactly what it says. It just talks to the wrong system.
What the QE Fleet Actually Does
Every feature went through this loop:
Claude Flow implements → QE brutal-honesty review → fix → review again → ship
The review modes aren't polite. “Ramsay mode” is adversarial criticism: “these 6 files are theater, delete them.” “Linus mode” is technical precision: “your prefix constraint matches the full field string, not individual words — 'but' won't find 'Peanut Butter'.”
The convergence pattern held every sprint:
- Round 1: Structural failures (dead code, wrong APIs, wrong persistence layer)
- Round 2: Algorithm bugs and edge cases
- Round 3: Code quality and test traps
- Round 4: Nits — you're done
19 rounds across 6 sprints. 76+ issues surfaced. Zero were caught by automated tests first.
The Meta-Lesson
A fix swarm ran on 21 identified issues and closed all 21 tickets. Build was green. 642 tests passed.
Then the BH review ran on the fixes themselves and found:
- One “SRI hash added” fix was actually a TODO comment. The CDN script still loaded with zero integrity verification.
- One “MessageChannel test added” fired a completely different code path from the one it claimed to test. Coverage went up. The actual function stayed at 0%.
- One clearTimeout fix was correct — but the commit deleted 387 lines of related tests to “clean up legacy duplicates.”
The fix swarm optimizes for closing tickets, not for solving problems.
Apply the brutal-honesty review to your fixes, not just your features.
Why Two Fleets Beat One
A single AI agent that builds and reviews its own code has a fundamental blind spot — it cannot adversarially challenge its own assumptions. It will write tests that confirm what it already believes is true.
Separating builder and reviewer into distinct agent hierarchies creates genuine tension:
- Claude Flow builds a sprint in 30 minutes
- QE fleet reviews and fixes in 2–5 minutes
- The asymmetry is the feature
The builder optimizes for shipping. The QE fleet optimizes for catching what the builder missed. The collaboration only works because they have conflicting incentives — and shared memory to argue over.
The Stack
Built with: Next.js 14 App Router, TypeScript, Tailwind CSS, Vitest, Playwright (Python), k6, next-intl, Serwist PWA, Claude Flow, Agentic QE Fleet v3
Architecture: Domain-Driven Design (7 bounded contexts), hybrid SSR/SSG rendering, per-pet toxicity model (separate dog + cat data per food entry)
Quality gates enforced on every push: 85% coverage threshold · JSON schema validation · TypeScript strict · ESLint zero-warning · 42 E2E tests · Lighthouse A11Y 100
Credits
Claude Flow — the multi-agent builder framework that powered every sprint. Built by ruvnet. The vector memory layer (ruvector-postgres) is what made shared state between the two fleets coherent across sessions.
Dragan Spiridonov — QE architect and the person who originally shaped the quality engineering philosophy behind this project. His approach to adversarial testing and brutal-honesty reviews is what made the QE fleet more than a test generator.
What I'd Tell Anyone Trying This
- Don't skip the adversarial review. Test generation is table stakes. The brutal-honesty review is where the real bugs surface.
- Test with real data from day one. Synthetic toy datasets hide the collision cases that matter. The fish/mushrooms bug only exists in a 193-food database.
- Multi-agent parallelism needs shared memory. Without a shared namespace, agents duplicate work and step on each other. Both fleets read and write to the same project memory. That's what makes the collaboration coherent.
- The bottleneck shifts from code to decisions. I didn't spend time writing tests or wiring CI. I spent time deciding: which foods matter, what to do with ambiguous toxicity data, how to handle UNKNOWN safety levels. AI executes. You strategize.
- Review the fixes. Not just the features.
- Humans still need to verify everything. The search bugs above were found by manually using the live app — not by any agent, not by any coverage report, not by any review round. The fleets are force multipliers. They are not a replacement for human judgment and real-world testing. Ship the AI-built features. Then actually use the product.
The live app: safepetplate.com