He Pays Claude $40K/Month to Replace His Team. Half His Day Is Spent Debugging It.

Andrew Wilkinson runs his family office on Claude. The math he gave on stage: 50% debugging, 30% setup, 20% real output.

9 min read

The morning ritual of the vibe-business-coder. You open your macbook and you don't ask what your agents produced overnight πŸ€“ You ask what they broke. You walk the floor. Logs, drifts, hotfixes. You're not running an autonomous company, you're pulling extended guard duty.

TLDR: Andrew Wilkinson runs a $400M holding and pays a $40K Claude bill every month to replace headcount. He calls it an autonomous company. On stage, he also gave the ratio nobody else volunteers: 50% debugging, 30% setup, 20% real output. The math works at his scale. The word doesn't.

Exhausted developer at desk surrounded by error logs and red warnings, holding expensive coffee mug, while confident figure points at whiteboard reading 'CONTEXT WINDOW = AMNESIA'
Paying $40K/month to replace your team? Debugging is the real job now.

This week, Andrew Wilkinson sat on Greg Isenberg's podcast (56k views in 24h) and said he runs his family office on a $40K-a-month Claude bill. Then he gave the honest ratio (rare in this corner of the internet): 50% debugging, 30% setup improvement, 20% real output. Andrew is the most convinced man in the game. He vibe-codes Deep Personality, a SaaS at around $20K revenue. His CFO, who has zero coding background, rebuilt a replacement for Addepar (a wealth platform priced at $50K to $100K a year) in roughly two weeks. Let's see what we can actually pull from this "PRO".

What Andrew Actually Said on Stage

Tiny is not a side project. Andrew runs a holding company with a portfolio over $400M and 24 companies under it. He is not a skeptic looking for a viral take. He is the guy buying the most Claude credits on the West Coast and telling the camera it works.

The numbers, in his own words on the show.

His family office swapped headcount for a Claude bill. The bill is around $40,000 a month. The work the bill replaces would have been done by a small ops team a year ago. He calls this an autonomous company. He says it without irony.

Deep Personality is the consumer SaaS he keeps as a vibe-coding playground. About $20K of revenue. Built and maintained mostly by his agents. He admits, on the same podcast, that debugging eats half of his day on this product alone.

The Addepar replacement is the most striking story. His CFO, who never wrote production code in his life, vibe-coded a tool that replaces a wealth-management platform priced between $50K and $100K a year per seat. Two weeks. A non-engineer. Replacing a multi-million-dollar enterprise SaaS at his scale.

And in the same breath, the ratio. Half debugging. Thirty percent improving the setup itself, the prompts, the harnesses, the context files. Twenty percent actual output the business sees.

Two truths sit in that interview, and they don't cancel each other out. Andrew's agents ship real outcomes that justify the bill. And Andrew spends half of every day playing nurse to those agents. The first truth is what gets clipped. The second one is what makes the first one possible. The X bubble keeps the first part and quietly drops the second.

A reminder before we go further. Andrew's math works because of his scale. A solo builder at $20K of monthly revenue cannot afford a $40K Claude bill plus 50% of his day in supervision. Andrew can. The math doesn't generalize down. We'll come back to that.

"Autonomous" Is the Most Dishonest Word in AI Right Now

Autonomous should mean runs without intervention. Open a dictionary. That's the whole job of the word.

What Andrew described, what every operator I know running agents in production lives through, is something else. The agent ships. Then the operator audits. The operator fixes. The operator rebuilds the morning context. The operator briefs again. The agent ships again. Repeat.

That is supervised work with a fashionable label. We just stopped using the word "supervised" because it kills the pitch.

Andrew himself is honest about the ratio. He gave the number on stage. The dishonesty is downstream, in the X clips that quote his shipping wins and crop out his debugging hours. The dishonesty is in the dozens of "I built an autonomous company over the weekend" posts that don't include the part where the founder spent his Sunday rolling back six commits the agent shipped while he slept.

If we want the word to mean anything, somebody has to explain why the 50% exists. Otherwise we're just selling a polished version of "I have a junior who needs constant hand-holding, but he scales."

The Forgetting Problem

Andrew said 50% debugging. He didn't say why. Here is the most likely reading, and it's mine, not his.

The agent doesn't remember your company. The agent doesn't even remember yesterday.

A context window is a finite room. Today's best models top out at a few hundred thousand tokens. That sounds like a lot until you try to fit a whole business in there. Your codebase. Your naming conventions. The decision you made on Tuesday about the new endpoint. The Slack thread where your CFO said the invoicing flow needed a fallback for partial refunds. The CSV layout your distributor sends every Monday at 4am. Multiply by every tool, every integration, every quirky business rule.

Doesn't fit. Not even close.

So every morning, you don't wake up your agent. You re-onboard it. You replay the relevant parts of the company's brain into its context. You fix the things it forgot. You discover the things it half-remembered and got slightly wrong. That re-onboarding cost is the 50%. It's not a bug in the prompts, not a bad harness. It's the memory shape of the underlying model.

Andrew himself, on the same podcast, names the threshold. He thinks the unlock arrives somewhere around 5 to 10 million tokens of usable context. The number where a model can hold a whole company in its head at once. Order of magnitude, not benchmark. We're not there yet. Frontier models hit hundreds of thousands of tokens, not millions, and the quality of recall degrades long before the limit.

Until that gap closes, every "autonomous" agent is a brilliant amnesiac. It can do real work. It just can't keep doing it without you sitting next to it, refreshing its memory of what it did yesterday and why.

There's a workaround that takes the edge off, and it's the one I shipped after enough of those morning rituals. You encode the context as a spec the agent reads before each task. Not a vibe instruction, a contract. Inputs, outputs, invariants, failure modes, the decisions that already got made. The contract becomes the prosthesis the model is missing. It doesn't fix the amnesia. It compensates for it, the way reading glasses don't fix bad eyes but let you finish the page.

That prosthesis is necessary today. Until the context window absorbs a whole company at once, the workaround stays.

What $40K/Month Actually Buys You

A $40K monthly Claude bill is not a headcount replacement. That framing is the trap.

What Andrew actually bought is a relocation of work. The agents do the execution. Andrew does the supervision. Before, he paid people to do execution and other people to manage them. Now he pays Claude to do execution and pays himself in supervision time. The total cost of the operation is the bill plus 50% of his attention, not just the bill.

For Andrew, the math still wins. He has spent twenty years sitting in hiring panels and Slack DMs and one-on-ones about quarterly performance. His clear, repeated point on the show: the worst part of business is people. He genuinely prefers the trade. He'll babysit ten agents over managing three humans, every day of the week. At his scale, with his fatigue, the swap makes sense.

For a solo builder at $20K monthly revenue, the math inverts. You don't have a $40K cushion. You don't have twenty years of management fatigue to escape from. You're trading salary you can't afford against time you have even less of. The same agentic stack that frees Andrew traps you. Same tools, opposite outcomes. The X bubble flattens that distinction. Andrew is honest about his scale. The clips aren't.

Now here's the part the critics of this whole movement keep skipping. Even with the 50%, the productivity ceiling has moved in a way that should genuinely scare anyone watching from the sidelines.

Speaking from my own bench: I'm shipping a hundred times faster than I used to. A thousand times on the small stuff. I'll spend one day getting an app to 80% (the part that took two months in 2022) and then two days debugging the rest. The math is brutal in both directions. What actually drives me mad is the morning the agent stops mid-task and announces, with full confidence: "I first need to understand the classifier architecture and the WooCommerce sync." Buddy. You wrote that code. Last week. Every single line of it. πŸ™ƒ

Andrew's CFO story sits in the exact same emotional register, scaled up by an order of magnitude. A non-engineer rebuilt Addepar in two weeks. A platform that costs five figures per seat per year. The fastest consulting firm on the planet does not deliver wealth-management software in two weeks, with a non-engineer in the driver's seat, in 2022. The agents are inefficient at the operator level (50% lost to debugging) and historically efficient at the output level (capabilities that simply were not on the menu eighteen months ago).

That's the part that should keep you up at night. Not whether the agents are autonomous. They aren't. What matters is what a single supervisor running a leaky bucket of brilliant amnesiacs now produces, compared to what a fully staffed team produced three years ago. The delta is brutal. It keeps growing. The 50% inefficiency is the entry fee for sitting at the table where that delta exists.

I wrote elsewhere that I now manage 150 agents the way I used to manage 5 humans, and the ratio is still in shock. The ratio is real. What I want to add today is the part nobody prints: the ratio scales, but the absolute babysitting time scales with it. Manage 5 agents that forget every morning, you spend an hour a day re-onboarding. Manage 150, you spend most of your week.

The opportunity is terrifying. So is the cost of staying close enough to the agents to grab it.

The Tell: Even Andrew Briefs His Agents Like Junior Hires

Andrew gave his best prompting tip on the show. I'll concede the point first because the tip is genuinely good.

Before letting the model generate anything, he asks it to interview him. Multiple-choice questions. Five, ten, sometimes twenty. Forced choices on scope, on edge cases, on naming, on what to skip. Only after the QCM is done does the model produce the artifact.

Adopt it. It's one of the few prompting tricks from the last two years that survives contact with production. It cuts hallucinations. It surfaces decisions you would have made implicitly and gotten wrong. It saves the rollback later.

Now read it again. If your agent needs to interview you in multiple-choice form before every meaningful task, what does that tell you about its level of autonomy?

It's a junior who doesn't have the brief. A smart junior, fast, tireless, never sick. But a junior who walks into your office, asks four questions before lifting a finger, then produces something close to right. That's not delegation. That's pair-programming with the verbosity dialed up. It's the same context burn problem we've been documenting from the other angle: the model can't carry your project in its head, so it has to ask each time.

Andrew found the pragmatic prosthesis. The QCM is the prosthesis. He just doesn't name it as one. He calls it a prompting tip. It is. It is also the loudest tell in the entire interview that "autonomous" is the wrong word for what's happening.

He has a full-time amnesiac employee who costs $40K a month and asks to be briefed every morning. He calls it an autonomous company. He is getting more done than any operator at his scale ever managed. Take the trade. Refuse the word.

The word is fake. The receipts are not.

Sources


Andrew's $40K Claude bill is realβ€”but half his day goes to debugging. The demo-vs-product checklist in the welcome kit shows you the 8 criteria that separate 'it shipped' from 'it actually works in production.'

β†’ Get the welcome kit