Back to blog
May 18, 2026

Harness Engineering: Building Systems Around AI Coding Agents

Harness EngineeringAI AgentsCoding AgentsArchitectureContext EngineeringMCP

Five independent teams converged on the same finding: coding agents become reliable only when you build the right scaffolding around them. Here's what a production-grade agent harness actually looks like — context architecture, tool constraints, verification loops, and the 7 components that separate agents that ship from agents that hallucinate.

Harness Engineering: Building Systems Around AI Coding Agents

The Problem

Every team building AI coding agents hits the same wall. The model is smart — GPT-5, Claude Opus 4, Gemini 2.5 — but the agent still makes dumb mistakes. It writes code that doesn't compile. It edits the wrong file. It forgets the project conventions halfway through a session. It "hallucinates" APIs that don't exist.

The natural instinct is to blame the model. Swap in a smarter one. But then the same failures reappear, just dressed differently.

In early 2025, LangChain ran a controlled experiment on Terminal Bench 2.0 that changed how I think about this problem. They improved their coding agent's accuracy from 52.8% to 66.5% — a 26% improvement — without changing the underlying model at all. The changes were all structural: better context files, structured output constraints, self-verification loops, tool optimization.

The ceiling wasn't the model. It was everything around the model.

This is the central insight of harness engineering — the discipline of building systems, tools, constraints, and feedback loops around AI coding agents to make them reliable and productive.

What Is Harness Engineering?

The term was coined by Mitchell Hashimoto (co-founder of HashiCorp) and gained mainstream attention when OpenAI published their Codex architecture deep-dive in early 2026. The core formula is deceptively simple:

> Agent = Model + Harness

The model provides intelligence. The harness makes that intelligence *useful*. A better harness often matters more than a better model.

┌─────────────────────────────────────────────────┐
│                 THE AGENT                        │
│  ┌───────────────────────────────────────────┐   │
│  │              THE MODEL                     │   │
│  │  (GPT-5 / Claude / Gemini / Local LLM)    │   │
│  └───────────────────────────────────────────┘   │
│              ↕  context + tools + constraints    │
│  ┌───────────────────────────────────────────┐   │
│  │            THE HARNESS                      │   │
│  │  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────────┐  │   │
│  │  │Ctxt  │ │Tools │ │Sub-  │ │Verif.    │  │   │
│  │  │Eng.  │ │/MCP  │ │Agents│ │Loops     │  │   │
│  │  └──────┘ └──────┘ └──────┘ └──────────┘  │   │
│  │  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────────┐  │   │
│  │  │Arch. │ │Con-  │ │Hooks │ │Memory/   │  │   │
│  │  │Const.│ │text  │ │      │ │Persist.  │  │   │
│  │  └──────┘ └──────┘ └──────┘ └──────────┘  │   │
│  └───────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘
Agent = Model + Harness architecture diagram

Hashimoto's guiding principle:

> "Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

This isn't prompt engineering. It's systems engineering for AI.

Why It Matters Now

In 2025, every team adopted AI coding agents. By 2026, the winning teams are the ones who engineered their agent environments — not just picked the best model. Five independent teams converged on the same finding:

Team
Result
Key Insight
OpenAI (Codex)
1M+ LOC codebase with 3 engineers
Progress came from building tools and scaffolding, not model capability
Anthropic (C compiler)
16 parallel Claude agents, 100K LOC
"Most effort went into the environment — tests, feedback, orientation"
LangChain
52.8% → 66.5% accuracy
Harness-only improvements, no model change
Stripe
1,000+ agent-generated PRs/week
Verification loops catch 94% of agent errors
Vercel (v0)
1M+ deployments
Context engineering was the bottleneck, not code generation

The 7 Components of a Production Harness

Through research and synthesis of the best practices across these teams, a clear pattern emerges. Every production-grade harness comprises seven components:

1. Context Engineering

This is the foundation. Give the agent a map of your codebase, conventions, and constraints — but do it surgically.

A study by HumanLayer (April 2026) found that instruction files under 60 lines significantly outperform longer ones. LLM-generated instruction files (200+ lines) produce worse results than concise human-written ones:

# Good instruction file (under 60 lines, human-written)
## Role
You are a code review agent for Python backend services.
## Environment
- Python 3.11, FastAPI, SQLAlchemy
- Tests: pytest, run with `pytest tests/`
- Linting: ruff, run with `ruff check .`
## Conventions
- All functions need type annotations
- Error handling: raise specific exceptions, never bare `except`
- Database operations: always use context managers
## What to do when stuck
1. Read the relevant test file first
2. Check existing patterns in similar files
3. If still blocked, add a TODO comment and move on
## What NOT to do
- Never modify migration files
- Never change pyproject.toml without explicit instruction
- Never assume a module exists without importing it first

The Claude Code source (leaked April 2026) confirmed this architecture first-hand: CLAUDE.md and AGENT.md files are assembled as "user content" alongside the dynamically built system prompt — a first-class part of Anthropic's own harness.

Three tiers of context is the emerging pattern:

┌─────────────────────────────────────────────┐
│        TIER 1: AGENTS.md / CLAUDE.md         │
│  Project-wide conventions, architecture map  │
│  Written by humans, kept < 60 lines          │
├─────────────────────────────────────────────┤
│        TIER 2: TASK-SPECIFIC CONTEXT         │
│  Relevant files, recent git diff, TODO lists │
│  Dynamically assembled per session           │
├─────────────────────────────────────────────┤
│     TIER 3: PROGRESSIVE DISCLOSURE           │
│  Read more files on demand as agent needs    │
│  them. Don't dump everything upfront.        │
└─────────────────────────────────────────────┘
Three-tier context architecture diagram

2. Architectural Constraints

Agents need guardrails that are *enforced*, not suggested. Prompts are probabilistic; linters, CI checks, and type systems are deterministic.

Constraint Type
Probabilistic (Prompt)
Deterministic (Enforced)
Coding style
"Follow PEP 8"
ruff check --fix in CI
Security
"Don't expose secrets"
Secret scanner blocks PRs
Architecture
"Use our patterns"
Architecture tests fail on violations
Dependencies
"Don't add deps"
deny.toml blocks unauthorized packages

Spotify's Honk system — which has merged 1,500+ AI-generated PRs across hundreds of repositories — treats deterministic enforcement as the primary reliability layer. The agent proposes code; CI enforces the rules. If the agent breaks a convention, the PR doesn't merge.

3. Tools / MCP Layer

The Model Context Protocol (MCP) standardizes how agents discover and interact with tools. Every tool is a capability with a schema, inputs, and outputs — the agent selects tools by understanding their interface.

┌──────────────┐     MCP Protocol      ┌──────────────┐
│   AGENT      │ ◄──────────────────►  │  MCP Server  │
│              │   list_tools()        │              │
│  ┌────────┐  │   call_tool()         │  ┌────────┐  │
│  │ Model  │  │   tool_result()       │  │ Shell  │  │
│  └────────┘  │                       │  ├────────┤  │
│              │                       │  │ Files  │  │
│              │                       │  ├────────┤  │
│              │                       │  │ Git    │  │
│              │                       │  ├────────┤  │
│              │                       │  │ Editor │  │
│              │                       │  └────────┘  │
└──────────────┘                       └──────────────┘

Key tool design principles from production harnesses:

  • Scoped tools: Give the agent read-only filesystem access unless explicitly requesting write
  • Explicit schemas: Every tool has typed inputs and documented side effects
  • Idempotency: Running the same tool twice should produce the same result (or be safe)
  • Error recovery: Tools return structured errors, not crashes
  • 4. Sub-Agent Pattern

    A single agent loop breaks down at scale. The solution is a multi-agent architecture with separated concerns — what Anthropic calls the Planner → Generator → Evaluator pattern:

    ┌──────────┐     spec     ┌───────────┐    code     ┌───────────┐
    │ PLANNER  │ ──────────►  │ GENERATOR │ ──────────►  │ EVALUATOR │
    │          │              │ (Builder) │              │  (QA)     │
    │ Expands  │              │ Implements│              │ Tests     │
    │ prompt   │              │ features  │              │ & grades  │
    │ to spec  │              │ one at a  │  feedback    │           │
    │          │              │ time      │ ◄─────────── │           │
    └──────────┘              └───────────┘              └───────────┘
    Planner → Generator → Evaluator multi-agent pattern

    The evaluator is critical because agents consistently rate their own work too generously. Separating generation from evaluation creates an honest feedback loop. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.

    5. Hooks and Lifecycle Events

    Production harnesses fire hooks at key lifecycle points. These allow custom logic to run before, during, and after agent execution:

    Session Start
      │
      ├── on_init       → Load context, validate environment, check token budget
      │
      ├── on_tool_call  → Validate inputs, rate limit, log for audit
      │
      ├── on_response   → Check output quality, enforce constraints
      │
      ├── on_error      → Retry with backoff, escalate if persistent
      │
      └── on_complete   → Sign artifact, update project state, notify team

    OpenAI's Codex team found that hooks were the single highest-ROI harness component — they caught 40% of production incidents before they reached users.

    6. Self-Verification Loops

    The most impactful pattern to emerge from production harnesses: make the agent verify its own work.

    Agent writes code
           │
           ▼
    Run tests / lint
           │
           ▼
    Pass? ──Yes──► Done
           │
           No
           │
           ▼
    Feed error output back to agent
           │
           ▼
    Agent fixes code
           │
           ▼
    Run tests / lint (loop)
    Self-verification loop flow chart

    Stripe's 1,000+ PRs/week pipeline uses exactly this loop. The agent generates code, CI runs tests and linting, and if anything fails, the full error output is fed back to the agent for a fix attempt. This loop repeats until either all checks pass or a max-retry limit is hit.

    7. Memory and Persistence

    Agents with fresh context every session forget everything. Production harnesses implement persistent memory — key decisions, learned patterns, and user preferences that survive session boundaries:

  • Filesystem-based: Key decisions, user preferences, session summaries written to structured files
  • Vector stores: Semantic search over past decisions and solutions
  • Knowledge graphs: Entity-relationship maps of the codebase that persist across sessions
  • The Twist: Better Models Need Better Harnesses

    Here's the counterintuitive finding. When I started this research, I assumed smarter models would reduce the need for harness engineering. The opposite is true.

    More capable models unlock more autonomy. More autonomy demands better guardrails. Nicholas Carlini (Anthropic's C compiler project) put it directly: each time Claude got smarter, they had to redesign the harness to handle the new class of failures it could generate.

    Model Capability
         │
         │  Without harness: failures scale with capability
         │  ┌────────────────────────────────────────┐
         │  │  🚫 Model gets smarter                 │
         │  │  🚫 Can do more harm when wrong        │
         │  │  🚫 Harder to predict failure modes    │
         │  └────────────────────────────────────────┘
         │
         │  With harness: capability is amplified
         │  ┌────────────────────────────────────────┐
         │  │  ✅ Model gets smarter                 │
         │  │  ✅ Harness catches new failure modes  │
         │  │  ✅ Each capability level is safe       │
         │  └────────────────────────────────────────┘
         │
         └──────────────────────────► Time

    Lessons from the Trenches

    Start monolithic, add agents when you need them

    Multi-agent systems introduce the same complexity as microservices, compounded by non-determinism. Mitchell Hashimoto's advice: start with a single-agent loop. Reach for multi-agent when you hit a specific ceiling — like the evaluator needing its own context budget, or the planner needing to think in parallel with the builder.

    Keep instruction files under 60 lines

    Every line in your instruction file costs tokens on every single API call. If it doesn't prevent a real mistake the agent has made before, cut it. LLM-generated instruction files are verbose and fluffy; write your own.

    Fail closed on auth

    If your agent's auth middleware can't verify identity, reject the request. A 401 during a configuration hiccup is far better than silent data exposure. This applies to agents just as it applies to APIs.

    The harness is a living system

    Your harness should evolve as you discover new failure modes. Every time an agent makes a mistake you didn't catch, that's a signal to improve the harness — add a constraint, tighten a tool, write a new hook.

    Results

    #
    Component
    Impact
    1
    Context Engineering (3 tiers, <60 lines)
    Prevents "lost in project" failures
    2
    Architectural Constraints (deterministic CI)
    Stops convention violations at merge
    3
    Tool/MCP Layer (scoped, explicit)
    Eliminates unsafe filesystem operations
    4
    Sub-Agent Pattern (Planner→Generator→Evaluator)
    26% accuracy improvement (LangChain)
    5
    Hooks (on_init, on_tool_call, on_error)
    Catches 40% of incidents (OpenAI)
    6
    Self-Verification Loops
    94% error catch rate (Stripe)
    7
    Memory & Persistence
    Cross-session learning, no context loss

    What's Next

    The harness engineering field is moving fast. The trends I'm watching:

  • Standardized harness frameworks — OpenCode's sub-agent model, Atomic's tiered context, and Anthropic's leaked patterns are converging
  • MCP becoming the universal tool protocol — Every major agent platform now supports it
  • Harness-as-a-Service — Expect managed harness infrastructure (like Vercel for agent deployment)
  • Evaluator models — Specialized models trained specifically to catch agent mistakes
  • The bottom line: if your coding agent is unreliable, don't blame the model. Look at the harness. That's where the leverage is.


    Check more open source repos at: github.com/mnkrana/