The Beginning: One Skill to Rule Them All

It started with curiosity.

Before I knew anything about agentic AI, agent orchestration, or prompt engineering patterns, I discovered Claude's skill system. I thought: what if I could teach Claude everything about our QA process and have it help us write test cases?

So I built a skill. One massive skill.

It had everything in it — every navigation path in our application, every test case pattern, every business rule, acceptance criteria examples, domain terminology, automation conventions, page object structures. It was a comprehensive instruction manual for our entire QA process, packed into a single file.

And it worked. Sort of.

The skill was too large. It consumed enormous amounts of tokens. It was hard to update — changing one thing meant re-importing the entire package. Sharing it with the team was a nightmare because it was zipped and imported into Claude Desktop as a single monolithic block. Every update meant re-zipping, re-importing, and hoping nothing broke.

But the idea was right. The execution needed to evolve.


The Turning Point: “Your Skill Is Too Big”

The breakthrough came from a conversation with Dragan Spiridonov, who looked at what I'd built and said something that changed my approach entirely:

“Your skill is too big. A skill should just be a short, specific instruction for an agent — not an encyclopedia.”

My skill had all the data, all the paths, all the instructions, all the test case examples, everything bundled together. What it should have been was a focused directive — a concise set of rules that tells an agent exactly what to do, not everything it might ever need to know.

That distinction — between instructions and knowledge — became the foundation for everything that followed.


Phase 1: Breaking the Monolith

I started decomposing. The single massive skill became multiple focused files:

  • skill.md became the primary instruction source — short, specific, directive
  • knowledgebase.md held the domain knowledge
  • gaps.md documented what we didn't know yet
  • domain-knowledge.md captured the industry terminology and business rules

This was better. Each file had a clear purpose. Updates were targeted. But the architecture was still limited — everything was still zipped and imported into Claude Desktop.


Phase 2: Claude Code Changed Everything

Then Claude Code arrived, and the entire paradigm shifted.

Claude Code reads .claude/ directories natively. No zipping. No importing. No token-heavy skill packages. Just markdown files in a folder structure.

This meant:

  • Updates were instant — edit a file, Claude sees it next session
  • Sharing was version control — git pull and everyone has the latest fleet
  • Structure was unlimited — as many files as needed
  • The team could contribute — anyone could edit a knowledge file and commit it

I rebuilt everything from scratch using this new architecture.


The QE Fleet v1.0: Specialized Agents

Instead of one skill that knows everything, I built specialized agents — each one an expert in exactly one thing:

Test Design Agents:

  • Ticket Analyzer — reads a Jira ticket, extracts requirements, risks, edge cases
  • Test Case Designer — designs structured test cases from an analysis
  • CSV Generator — produces Xray-compatible CSV files ready for import
  • Bug Reporter — turns rough notes into professional bug reports

Quality Agents:

  • Regression Risk Analyzer — assesses what needs retesting when code changes
  • Release Readiness — provides structured go/no-go recommendations

Orchestrator Commands:

  • /qe-generate-tests — quick path: ticket to CSV
  • /qe-full-pipeline — analysis, test design, CSV, and automation plan

7 agents, 7 skills, 4 commands.


The QE Fleet v2.0: From Plans to Working Code

v1.0 could analyze tickets and generate test cases. But it couldn't write automation code — it could only plan.

v2.0 changed that. We added 6 new agents and went from 7 to 13:

  • Page Object Writer — writes actual Python page objects following exact project conventions
  • Test Writer — writes pytest test files with correct markers, fixtures, cleanup patterns
  • Locator Scout — inspects the actual DOM on staging to extract real locators
  • SFDIPOT Analyzer — deep 7-factor analysis using James Bach's HTSM framework
  • Step Rewriter — transforms passive “Verify X” steps into active action verbs
  • Automation Planner — plans which tests to automate

The fleet could now go from a Jira ticket to working pytest code — page objects AND test files — following exact project conventions.


The QE Fleet v2.1: Closing the Locator Gap

v2.0 had one remaining weakness: the Locator Scout was half-built. v2.1 closed that gap by wiring up Playwright MCP — a browser server that Claude Code can drive directly.

What changed:

  • Locator Scout now opens a real browser, navigates to staging, runs JavaScript to read exact class names
  • /qe-setup-playwright — new command for configuration
  • anti-patterns.md — a living document capturing automation mistakes
  • setup/ folder — portable install package for new team members

What Made It Work: Rules, Not Freedom

We didn't ask AI to work instead of us. We used it as an addition to the team.

The AI writes test cases based on the rules we set as a team. The automation code follows conventions we defined. We didn't give it freedom — we wanted agents to follow our specific rules.

This showed up in concrete ways:

  • Terminology Standards: Never say “Happy Path” — use “Core Flow”. Never use “Verify/Check/Confirm” as action steps.
  • Priority Framework: Critical (financial, regulatory), High (core features), Medium (UI), Low (cosmetic)
  • Multi-System Verification: Any financial action must verify across all system layers
  • Automation Conventions: All page objects inherit from BasePage, @allure.step decorators, config from .env

The Human Factor: Team Adoption

One of the biggest challenges wasn't technical — it was introducing this to the team.

We weren't worried about AI replacing us. We were looking for ways to eliminate the tedious parts of QA so we could spend more time on the parts that actually require human thinking.

The team approved the innovation. They saw the time savings. They appreciated that it followed their rules, not its own.


The Biggest Problem We Didn't Expect: Garbage In, Garbage Out

If the user story is poorly written, the AI produces poor test cases. This isn't an AI limitation — it's a fundamental truth.

A human QA engineer, faced with a vague story, walks over to the product owner's desk and asks questions. AI agents can't do that.

This is why AI cannot replace QA. Not because it can't write test cases — but because quality engineering starts before the test cases exist. It starts in refinement, in asking the right questions, in challenging the story before accepting it.


What We Still See: AI Makes Mistakes

The AI still makes mistakes. Human tracking is always needed.

But here's the key insight: it's much easier for a human to review and correct AI-generated work than to write everything from scratch.

The AI is a time multiplier, not a replacement. It drafts, we refine. It follows rules, we set them.


The Evolution in Numbers

Feature v1 Desktop Skill v1.0 Claude Code v2.0 v2.1 Current
Architecture Monolith 7 agents 13 agents 13 agents 19+ agents
Sharing Zip export Git Git Git + Setup pkg Git + Setup pkg
Token Efficiency Very Low Medium Medium Medium High (Vector)
Can Write Code No No Yes Yes Yes
Locator Scouting No No Partial Yes (MCP) Yes (MCP)
Code Indexing No No No No Yes (Vector)
SFDIPOT Analysis No No Yes Yes Yes
Jira Integration No No No No Yes (MCP)

The Architecture Today

The pipeline architecture shows how all the agents work together:

User: "Full QA for PROJ-1234"
  → CLAUDE.md routes
  → /qe-full-qa orchestrator
    → qe-ticket-analyzer (fetches from Jira)
    → qe-sfdipot-analyzer (7-factor analysis)
    → qe-test-case-designer + qe-csv-generator
    → qe-step-rewriter
    → qe-page-object-writer + qe-test-writer
    → brutal-honesty-review (quality gate)
  → Output: Analysis + CSV + Page Objects + Tests + Quality Score

Phase 3: Team Adoption and Real-World Pressure

The team adopted it. But adoption brought real-world pressure — agents needed tuning. Not big architectural changes, but small things: too verbose output, formatting issues, aggressive step rewriting. The iterative polishing that turns a prototype into a reliable tool.


Phase 4: The Token Problem — Hitting the Budget Wall

Limited Claude budget. 13 agents, 11+ skill files totaling ~9,000 lines. Every pipeline loaded entire skill files. A single pipeline run consumed 15,000–45,000 tokens of knowledge alone.


The Solution: A Local Vector Store

Vector search — running entirely on laptops with zero API costs. Instead of reading entire files, chop knowledge into small chunks, convert to vectors, store in ChromaDB. When an agent needs context, search for relevant chunks only.

Results:

  • module-details.md: 97.9% savings
  • test-automation skill: 95.4% savings
  • Per pipeline: ~20,000 tokens drops to ~4,000

Phase 5: The MCP Experiment — And Why We Came Back to the CLI

Tried MCP server for vector search. Windows compatibility nightmares, silent failures, 55,000 tokens schema overhead, complex setup. Came back to the CLI wrapper that was sitting in the codebase all along.

The best tool integration is the one that adds nothing.


Phase 6: Taming the DOM — Scout Locator Optimization

DOM inspection token waste. Built a two-tier system:

  • Option A: Targeted queries (~90% token reduction)
  • Option B: Optimized DOM snapshot fallback (~70–80% reduction)

Agent Hygiene: A Refactoring Pass

Six maintenance fixes: hardcoded credentials removed, environment config corrected, missing skill loading added, nonexistent methods fixed, quality gate aligned with CI, review scoping fixed.

The maintenance tax of an agent fleet: agents only know what their instructions say.


Lessons Learned

  1. Start with one thing that works. Don't try to build the entire fleet at once. Build one agent, make it reliable, then add the next.
  2. Separate instructions from knowledge. A skill should tell an agent what to do. Knowledge files should tell it what it needs to know. Mixing them creates a monolith.
  3. Rules come from the team. The agents follow conventions the team defined. Not conventions the AI invented.
  4. Design for human review. Every output is a draft. The human reviews, corrects, and approves.
  5. Sharing matters more than sophistication. A simple system the whole team uses beats a brilliant system only one person understands.
  6. Adopt incrementally. Introduce one agent at a time. Let the team see the value before adding complexity.
  7. AI makes mistakes — plan for it. Build review steps into the pipeline. The brutal-honesty reviewer exists for a reason.
  8. AI is only as good as the input. Garbage user stories produce garbage test cases. Quality starts before the AI touches anything.
  9. AI can't ask questions — that's the real gap. A human QA engineer challenges vague requirements. AI accepts them and produces confident-sounding garbage.
  10. Optimization is continuous. Token costs, output formatting, agent behavior — all need ongoing tuning.
  11. Token budgets force better architecture. Running out of budget is what pushed us to vector search, which made everything better.
  12. Simpler beats cleverer. The CLI wrapper beat the MCP server. The targeted DOM query beat the full snapshot. Every time we chose simpler, we won.

The Bottom Line

We didn't set out to build an AI system. We set out to solve a problem: QA is time-consuming, repetitive work that follows patterns. AI handles the drafting. Humans handle the judgment.

From one oversized skill to 13 specialized agents — each following rules we defined, producing work we review, and saving time we reinvest into deeper quality work.

Updated March 2026 — This is a living document.