Prompt Contracts Framework: When AI Code is Perfect But Wrong 2026

The contract said return a JSON array of contacts. I hadn't set a limit, or pagination. The AI-generated code passed every test. Then a user pulled a report on a large segment and the browser died under 5,000 contacts dumped in one blob.

The code was correct. The contract was the bug.

I published an article on prompt contracts. It got shared enough that turning it into a book made sense. And while writing it, three things surfaced that a single article couldn't cover. The contract that breaks while the code is perfect. The border between reliable code and reliable facts. And the instinct that atrophies when you delegate too long.

TL;DR: The article said "specify before you generate." Still true. The book adds three corrections: the contract itself can be the bug, the contract produces reliable code but not reliable facts, and delegating without thinking kills the judgment you need to write good contracts. The feeling test (imagine the worst possible output: if your gut clenches, write a contract) bridges the framework and common sense.

Tech professional debugging complex contract management system with frustrated expression — When your epic framework decides to become a plot twist 🤦‍♂️

5,000 Contacts, Zero Errors, One Crashed Browser

I built the thing in a weekend. Small tool to sync distributor contacts for a WooCommerce store, full backend AI-generated from one prompt contract. Data model, API endpoints, validation, all of it specified upfront. Including this line: "return a JSON array of contacts matching the filter criteria."

Every acceptance criterion passed. I deployed on a Friday afternoon before taking the kids snorkeling (because apparently I hate peaceful weekends).

Monday. A user with 5,000 contacts in their distributor segment hits "Export." Browser goes white. Tab crashes. No error, no graceful degradation. Just a dead tab and a JSON blob the size of a short novel sitting in memory.

The AI didn't deviate from the contract. Not by a single line. It did exactly what I asked, which is exactly why it broke.

Chapter 4 of the book sits with that question: did the AI break the contract, or did I write a contract that specified the wrong behavior?

Three patterns of contract failure came out of that chapter. Missing edge cases: the contract said "return contacts" but never said what happens when there's 5,000 of them (no pagination, no cap, the absence of a constraint is itself a bad specification). Wrong domain assumptions: I assumed contacts would be tens, maybe hundreds, but that assumption lived in my head, not in the document. And over-specification of the cosmetic stuff while the structural boundaries were nowhere: I had detailed the exact JSON nesting, the field names, the types. But max payload? Fallback when it's too large? Nothing.

The fix wasn't better prompting. Three questions. I added three questions to the workflow:

Here is my contract for [feature].
Before generating any code:
- What edge cases am I not covering?
- What assumptions am I making that aren't written down?
- Where does this break at scale?

Three questions. That's it. The AI tears apart your spec before it writes a line. You patch the holes, then you generate.

That's the loop the book documents. Not prompt, generate, fix the code, reprompt (the naïve cycle that keeps the spec frozen while you patch symptoms). The real cycle: specify, generate, verify, revise THE SPEC. The spec is the living document. The code is the output. When the output breaks, you fix the input.

The real prompt contract cycle. Left side (grayed out/crossed): the naïve loop "...

The first version of any contract is a draft. Always. The feeling test helps decide if you need a contract at all (more on that later). But once you decide you need one, assume it has holes. The loop is the product, not the document.

The article said: write a contract. The book had to answer: and when the contract itself is the bug?

A broken contract, you can fix (patch the spec, regenerate). The next problem is more sneaky, because the contract cannot fix it at all.

The Framework Has a Border the Article Never Drew

Prompt contracts produce reliable code. They do not produce reliable facts.

That sentence is nowhere in the original article. It couldn't be. In a single article you show the framework working, you don't go mapping where it stops. A book doesn't let you get away with that.

A Stripe webhook either validates the signature or it doesn't. A database schema either respects the index or it doesn't. For code logic, the contract eliminates hallucination. The AI generates what you specified, the tests confirm it, done.

But my WooCommerce tool also generates reports with real distributor data. Contact URLs, audience numbers, partner names. Things that exist in the world, not in the contract. And here the contract brought hallucination down from 30% to about 15%. Down, not gone.

15% sounds small until you meet Karen from Accounting. Every pipeline has one. She's the person at the other end who doesn't care about your clean architecture. Karen cares that row 47 of the quarterly report lists a distributor that does not exist. You can explain that the code is technically correct. Karen will explain that the client called, and that your technically correct report made the company look like amateurs. Karen always wins these arguments. 🤷

The METR study found a 39-44% gap between how productive developers think they are with AI and how productive they actually are. Ox Security went further: 10 recurring anti-patterns in 80-100% of AI-generated code, what they called "an army of talented juniors without oversight." The contract is the oversight for code. For facts the AI pulls from its training data or invents from nothing, the contract has no jurisdiction.

For products where the value is in the code (most SaaS), the framework is enough. For products where the value depends on facts generated by the AI, you need a verification layer on top. The book documents both sides. The article only had room for one.

And this week makes the border more visible. 1,234 community skills for Claude Code. Nine categories, from deployment to testing to documentation. Every single one automates generation. How many automate verification of factual outputs? I scrolled through the top 10. Zero.

1,234 skills to generate code. The one to verify facts doesn't exist yet.

What follows is neither technical nor factual. It's personal, and it was the hardest chapter to write.

The Skill the Book Taught Me That No Slash Command Can Replace

I wrote this contract three months ago for a partner CSV import:

feature: partner-csv-import
acceptance_criteria:
  - Parse CSV with headers: name, url, category, audience_size
  - Validate each row: name non-empty, url valid format,
    audience_size integer > 0
  - Skip malformed rows, log them to stderr
  - Output clean JSON array to stdout
edge_cases:
  - Empty CSV → exit 0, empty array
  - CSV > 50,000 rows → stream processing, never load full file
  - Duplicate URLs → keep first occurrence, log duplicates

And here's what I generated last week for a throwaway script to rename some image files:

rename all .png files in /uploads to kebab-case, lowercase

No contract. No acceptance criteria. No edge cases. One line.

The difference is the feeling test. The CSV import touches partner data that feeds into production reports. If the parser silently drops rows or corrupts a URL, real people make real decisions on wrong data. My gut clenches. Contract. The image renaming? If it botches a filename I just re-run it. Shrug. Vibe code.

The original article had a simple pitch: write a contract, let the AI code. The book had to add the uncomfortable second part: but keep your instinct alive.

I worked at a French bank early in my career. They had COBOL batch jobs running in production for 25 years. Nobody touched them. Nobody needed to. The whole mise en place worked so well that understanding the system became optional. Until a regulatory change forced a modification, and the developers who understood the system were retired. Gone. Not because anyone made a mistake. Because the system was too reliable for too long.

Luciano Nooijen told MIT Technology Review that his coding instincts degraded after months of intensive AI usage. Same mechanism, just compressed from decades to months. Craig Weiss put it differently this week (and the dev community agreed loud enough to notice): the real moat is system design and product thinking, not syntax. The syntax is delegated. The judgment cannot be.

The funny part (or terrifying, depends which day you ask me).

We went from reading every line of code to not reading it at all. And the replacement for "I verified this line by line" is not some superior automated verification system. No test suite catches everything. We run on a feeling now. That thing will blow up. That one's fine. This one, careful. We traded code review for gut check. The most advanced development workflow in history runs on your stomach. 🫠

The feeling test makes it a method instead of a guess. Imagine the worst possible output for the task you're about to delegate. Database corrupted. Payment charged twice. User data exposed in the logs. Gut clenches: contract. Shrug: vibe code freely.

I built the original prompt contracts framework after enough disasters like the 5,000-contact crash. The article captured the direction. The book maps the terrain, dead ends included.

The article taught a framework. The book taught me when not to use it.

What the Article Couldn't Cover

A 10-minute read proves a direction. "Specification beats improvisation." That part is settled.

But a 10-minute read doesn't let you sit with the failure modes. The contract that is the bug. The border where facts start and the framework stops. The slow erosion of your own judgment. I wrote the book to go deeper into all of that, for the devs who hit the same walls and want more than a compass bearing.

The book is called Prompt Contracts: How I Stopped Vibe Coding and Started Shipping Real Software With AI. It's on Amazon now.

The industry will stack skills, layers, abstractions. This week it's 1,234. Next month it will be 50,000. Each new layer makes it more comfortable to never look at what's happening underneath.

The book doesn't say "stop stacking." It says: know where the cracks are in the foundation. The wise man doesn't build his house on sand. The contract can be the bug. The contract has a border. And your instinct has an expiration date if you don't use it.

Feeling + method. Ship it.

Sources

METR study on AI-assisted development productivity (39-44% perception gap). Ox Security, "Top 10 AI Code Generation Risks" (anti-patterns in 80-100% of AI code). MIT Technology Review, Luciano Nooijen on coding instinct atrophy.

If you build with AI and want the honest version (gaps included, not just the highlight reel), follow along. Next one lands in your inbox.

(*) The cover is AI-generated. The three gaps in the article are certified organic, farm-to-table human mistakes.

When your AI code passes every test but still breaks, there's a deeper lesson about spec design that goes beyond prompting.

→ Get the AI Production Welcome Kit

I Wrote the Prompt Contracts Book. Chapter 4 Broke My Own Framework.

The AI followed your instructions perfectly. That's exactly why it broke.