Agent Harness Engineering: What 8 Months in Production Taught Me

Same model. 36 points higher on benchmarks. The fix was never the model.

Anthropic gave Opus 4.5 a high-level prompt to build a production web app. It failed. Not because the model was bad. Because it tried to one-shot everything (admit it, you do the same thing), left half-implemented features across context windows, and declared victory too early. They fixed the scaffolding, added progress tracking and incremental workflows: the same model started shipping. They called the article "Effective Harnesses for Long-Running Agents."

TL;DR: "Harness > model" is correct but incomplete. The mechanism that makes it work is progressive disclosure: show the model only what it needs, when it needs it. Same model jumped 36 points on CORE-Bench just by switching to a better scaffold. My framework after 8 months and 5 production apps: contracts over vibes, constraints over tools, simplify every quarter. Copy-paste templates included.

AI agent harness engineering framework showing progressive disclosure and scaffolding optimization — When your agent needs training wheels but calls them 'architectural patterns'

Then the word showed up everywhere. OpenAI published "Harness Engineering." LangChain bumped their coding agent from 52.8% to 66.5% by changing nothing but the harness. Mitchell Hashimoto and Martin Fowler wrote about it. SWE-bench Pro confirmed it at scale: same model, different scaffolding, different results.

I looked at my CLAUDE.md, my prompt contracts, my CLI wrappers, and realized this is exactly what I'd been doing for eight months across five production apps. I just never had the word to name it. Harness. That's the word.

So yes, the harness matters more than the model. That part is settled.

But knowing "the harness matters" is like knowing "eat healthy and exercise." True, useless if you don't actually do it, and about to generate an entire industry of overcomplicated frameworks that miss the point.

I've spent eight months making every mistake in the book. This is what survived.

The three mistakes everyone will make

I know because I made all three.

Mistake 1: Stacking tools instead of writing contracts

When I started wiring up my multi-model AI agent, I connected 12 MCP tools. Search, memory, credit checks, RSS monitoring, Discord alerts, cron status, user queries, backup verification. Felt thorough. Felt professional.

The agent spent more time deciding which tool to call than solving the actual problem.

On a simple "anything need my attention this morning?" query, it would fire 4-5 tool calls in sequence, sometimes hitting the same endpoint twice with slightly different parameters because the descriptions were vague enough to overlap. One morning it called check_users, then check_credits, then check_users again with a different filter, then gave me a response that contradicted itself between paragraphs.

I ripped out 8 tools. Replaced 12 with 4 that had precise descriptions written as contracts. Not "query credit data" but "find users whose current credit balance deviates from expected by more than 10%, flag the anomaly with drift amount, and sort by severity." Same underlying code. Same model. The only thing that changed was the description.

40% fewer tool calls. Outputs stopped contradicting themselves. The tool description was the problem the whole time.

I built the full prompt contracts framework around this principle and it became the single most impactful change in my entire workflow. You don't share code with the agent anymore. You share intent, constraints, and expected behavior. The description IS the contract.

Mistake 2: Choosing complexity when simplicity works

47,000 tokens. That's what Phil Schmid measured as the cost of integrating 6 MCP servers the standard way. Just schema definitions. Before your agent even starts thinking about your actual problem, it's chewing through forty-seven thousand tokens of JSON tool descriptions.

Manus solved this by exposing MCP tools through a CLI wrapper. Same capabilities. About 400 tokens.

I didn't know those numbers when I built my first MCP server in late 2025. Everybody was building them. The protocol was new, it was shiny, it felt like the correct abstraction. So I built one too. Custom OAuth flow, token refresh handling, multi-source data aggregation, the works.

Sixteen commits. Four hours debugging an auth token that expired mid-session. I was in a hotel room in Cancún with a pool literally ten meters away, watching logs scroll instead of swimming. Eventually shipped it and it works fine now. But then I also built CLIs that did the same thing for other services. The CLI needed a bash script and a JSON output. Worked on the first try.

Cool.

Vercel ran the same experiment at scale. Started with comprehensive tool libraries, search, code, file, API tools. Every capability you'd want. Agents got confused, made redundant calls, took unnecessary steps. They stripped to essentials, gave the agent direct bash access. Success rate went to 100%, speed increased 3.5x.

I wrote about why CLIs beat MCP for most agent setups and the reaction was wild. Turns out a lot of builders suspected the same thing but felt weird saying it while the entire ecosystem was pushing MCP as the future.

MCP has its place. But the instinct to reach for the complex solution first is exactly how harnesses get bloated before they get useful.

Mistake 3: Never removing anything

This one is sneaky because it feels irresponsible. You built something that works. It's in production. Removing it feels like removing a guardrail on a highway.

But models improve. And your harness doesnt know that.

Last month I removed an entire memory subsystem from my production agent. External context retrieval, embedding lookups, conversation history injection. It had taken two weeks to build and four months to maintain. I deleted it on a Thursday. By Friday the numbers told the story:

Response latency dropped 2.3 seconds per query. The agent stopped hallucinating "remembered" context that was actually stale data from three months ago. User satisfaction on support interactions went up because the agent was responding to what people actually said instead of what the memory system thought was relevant.

The model (Kimi K2.5) had gotten good enough at maintaining context within sessions that the external memory layer was actively making things worse. I was paying for infrastructure that degraded my product.

Manus, probably the most battle-tested autonomous agent in production right now, learned this the hard way five times. They rewrote their entire harness five times in six months. Not because models changed. Because each rewrite stripped complexity.

Their initial version used a todo.md file that the agent rewrote at every step to track progress. Roughly 30% of all tokens went to updating that file. They replaced it with a sub-agent planner that returns a structured object and injects it only when needed.

They cut their tools from dozens of dynamic MCP schemas to fewer than 20 atomic functions: bash, filesystem, code execution. MCP tools aren't even in the context window anymore. They're exposed via CLI that the agent calls through bash.

Peak Ji, their Chief Scientist, put it bluntly: "As models get stronger, we shouldn't be building more scaffolding, we should be getting out of the model's way."

Anthropic says the same thing: "As model capabilities increase, the tools that your models once needed might now be constraining them."

If your harness hasn't shrunk in three months, it's probably already too big.

The one pattern that makes all of this work

All three mistakes share the same root cause: giving the model too much, too early, for too long. Twelve tools when it needed four. MCP overhead when a bash script would do. A memory system injecting stale context the model had outgrown.

The fix has a name. Progressive disclosure. Show the model only what it needs, when it needs it. Hide everything else.

Cursor does this aggressively. Their dynamic context discovery system filters roughly 47% of available tokens from the model at any given step. Not by accident, by architecture. The model sees only what's relevant to this specific task, this specific moment.

Claude Code does it with skills. You create a skills/ directory, Claude sees only skill names and short descriptions at session start. It loads the full content only when it decides it needs it. Lazy loading for LLMs.

Manus does it with their layered action space. Level 1: 20 atomic tools, always visible. Level 2: sandbox utilities called through bash, never polluting context. Level 3: the agent writes its own scripts for complex chains instead of making three separate LLM roundtrips.

The benchmark impact is real. Same model, Claude Opus 4.5, scored 42% on CORE-Bench with a generic scaffold. With Claude Code as the harness, 78%. That's not just progressive disclosure, Claude Code brings better tool management, environment setup, and compaction. But the researchers who ran the test were blunt: the scaffold nearly doubled the score. The model didn't change.

The three pillars below are how I apply progressive disclosure in practice. Tools, configuration, maintenance.

The framework: contracts, constraints, cleanup

Not a 47-layer architecture diagram. Not a GitHub repo with 41 skill definitions and 11 sub-agents. Three things that actually survived contact with production.

Pillar 1: Contracts over vibes

You saw what happened with OpenClaw's tool descriptions. The reason it works is mechanical: the model does token-level pattern matching on your description to decide whether a tool is relevant to the current query. Vague descriptions match everything. Precise descriptions match only what you want.

The template I use for all my tool definitions now, and I'd suggest you do the same for your three most-used tools tonight:

name: [tool_name]
description: [WHAT specifically it returns, not vague nouns 
but the actual shape of useful output]. 
Call this when [specific trigger conditions]. 
Do NOT call when [common misuse case]. 
Expected output: [format and key fields].

The "Do NOT call when" line is the one that changes everything. Without it, the model treats every tool as a maybe. With it, the model has a contract.

Pillar 2: Constraints over tools

Every time you think "I need a new tool for this," stop. Ask first: can a single line in CLAUDE.md solve this instead?

Instead of a linter MCP server, a constraint: "Run tests before every commit."
Instead of a style-checking agent, a constraint: "Follow the conventions in CONVENTIONS.md."
Instead of a planning tool, a constraint: "Always write plan.md before touching code."

A constraint in CLAUDE.md costs zero tokens at runtime and adds zero failure surface. A tool costs tokens every time it's called, adds a decision point the model can get wrong, and needs maintenance. The math is obvious once you see it.

A starter CLAUDE.md I actually use as base across projects. Not a monolith, a navigation layer that points to specialized files:

## Role
Senior engineer. You plan before you code. You test before you push.

## Workflow
1. Read this file + any progress.md at session start
2. Plan first. Write plan.md before implementation.
3. One feature at a time. Commit after each.
4. Run existing tests before AND after changes
5. Update progress.md before session ends

## Constraints
- Never overwrite files without showing a diff first
- If a task needs more than 3 files changed, break it down
- When unsure, ask. Don't guess at business logic.
- Keep commits small and descriptive

## Project specifics
[Your stack, conventions, key files here]
See CONVENTIONS.md for code style rules.

Twenty lines that tell the agent how to work, not what to know. Anthropic's own harness for long-running agents uses a progress file, a feature list, and structured git commits on top of a similar core. OpenAI's Codex team learned the hard way that a giant AGENTS.md fails. Their advice: "give Codex a map, not a 1,000-page instruction manual."

The key is progressive disclosure applied to config: a short CLAUDE.md that points to detailed files the agent reads when it needs them. Not a 500-line monolith it skims and ignores. Not a 10-line stub that says nothing. A navigation layer.

Pillar 3: Quarterly cleanup

Every three months, I sit down with my harness and ask five questions:

Which tools hasn't the agent called in 30 days? Delete them.
Which CLAUDE.md rules exist because the old model was dumb? Remove them.
Which guardrails are now handled natively by the model? Strip them.
Is the context injection still necessary or has the model's retrieval improved enough? Test without it.
Can two tools merge into one with a better description? Do it.

Last quarter: I deleted a retry-with-different-model fallback that hadn't triggered in 6 weeks (Kimi K2.5 had become stable enough). I removed three CLAUDE.md rules about JSON formatting that the model now handles natively. I merged two monitoring tools into one with a more specific contract.

Net result: 30% fewer moving parts. Zero functionality lost. Faster responses. Less to maintain.

Manus runs this process continuously. Peak Ji's test: run your agent eval suite against a stronger model. If performance doesn't improve, your harness is hobbling it. That question alone tells you everything about whether you're building scaffolding or building a cage.

What this means for you tonight

Fifteen minutes. That's all you need to actually start.

Rewrite your three most-used tool descriptions. Find the tools your agent calls most. Open their descriptions. If they say what the tool does instead of when to call it and what to expect, rewrite them with the contract template above. Five minutes per tool. The impact is immediate, the agent stops guessing and starts following instructions.

Then structure your CLAUDE.md as a navigation layer. If you don't have one yet, paste the starter template above and fill in your stack details. If you already have one, check: is it a monolith or a map? Move detailed rules to separate files (CONVENTIONS.md, ARCHITECTURE.md) and keep your CLAUDE.md around 20-40 lines. The agent reads CLAUDE.md every session. It should find directions, not an encyclopedia.

And delete one thing. One tool. One CLAUDE.md rule. One middleware hook. Pick something you haven't touched in a month. Remove it. Run your normal workflow. If nothing breaks, it had no business being there. And if something breaks, congrats, you just learned what actually matters in your harness. That's worth more than any architecture diagram someone posted on X 💀

Harness engineering is trending. Give it six months. There will be courses. Certifications. GitHub repos with 41 skill definitions, 11 sub-agents, and a README longer than most codebases. Three thousand stars, zero production deployments.

Meanwhile, the builders who actually ship agents will keep doing what they were doing before the word existed. Writing clear instructions. Choosing simple tools. Deleting what stopped working.

The harness isn't a new job. It's the same job with a better name.

Sources:

Anthropic: Effective Harnesses for Long-Running Agents
OpenAI: Harness Engineering: Leveraging Codex in an Agent-First World
LangChain: Improving Deep Agents with Harness Engineering
Mitchell Hashimoto: My AI Adoption Journey
Martin Fowler: Harness Engineering
Manus context engineering: Lance Martin's discussion with Peak Ji
Phil Schmid: MCP CLI token reduction
CORE-Bench results: Sayash Kapoor on CORE-Bench being solved
SWE-bench Pro results: morphllm.com/swe-bench-pro

If these patterns save you debugging time or token costs, I write about this stuff regularly. Prompt contracts, agent architecture, the boring infrastructure decisions that make AI agents work in production instead of just in demos. Follow along if you want the field notes, not the hype.

AI made that cover image. I allocate my pixels budget exclusively to terminal windows.

After 8 months of agent harness engineering, I've learned that the right scaffold matters more than the model itself. Want the exact framework that boosted benchmark performance?

→ Get the Production Welcome Kit