The Part Your AI Agent Can't Handle Is the Business You Haven't Built Yet.

IKEA optimized for the 47% their chatbot handled. The €1.3B was hiding in the 53% it couldn't.

8 min read

IKEA's chatbot created jobs.

Not by doing more, but by revealing what it couldn't do.

Billie, IKEA's customer service agent, has been resolving 47% of incoming requests since 2021 without human intervention. 3.2 million interactions, €13M in savings, and every business transformation deck in Europe has a slide about it now. The 47% figure has been copy-pasted into so many pitch decks it's basically the "it works on my machine" of AI business cases: everyone references it, nobody checks what's running underneath.

Nobody covered the other column.

The 53% Billie couldn't handle fed a reskilling program: 8,500 call center workers retrained as remote interior design advisors. The channel generated €1.3B in FY22. To be accurate about what that means: the figure covers Ingka's entire remote selling operation, and part of that revenue existed before the reskilling program started. The causal link isn't perfectly clean. But the order of magnitude holds, and the strategic insight doesn't need the causality to be perfect. This isn't an HR story. It's a product signal everyone read backwards.

Split-screen office illustration: left shows anxious office worker celebrating AI metrics on dashboards; right shows confident professional analyzing error logs and business patterns with magnifying glass
Your AI's success metrics are someone else's business goldmine.

The Ratio Nobody Reported

The business press covered the IKEA case and immediately framed it as a workforce transformation story. "AI creates jobs, not destroys them." Nice headline. True, even. But it buried the actual insight under a layer of reassuring narrative mostly designed to make the C-suite feel better about the AI budget and the workforce optics at the same time.

IKEA didn't find €1.3B by building a better chatbot. They found it by asking what the chatbot couldn't do and treating that answer as a business opportunity. The 53% wasn't a failure metric. It was a demand map nobody had thought to read.

Every time Billie handed off a conversation, it was signaling a customer need the product couldn't serve yet. 53% of customer interactions were unsatisfied demand sitting in the log file. Unsatisfied demand isn't a problem to route around, it's a product gap with a price tag. And the longer you optimize the 47%, the more invisible the 53% becomes, because the dashboard makes it look like the system is working fine.

You're Reading the Wrong Dashboard

Every AI agent team I know watches deflection rate. It's the obvious metric: how many requests does the agent resolve without escalating? Higher is better. You optimize it, report it, put it in the monthly review deck. The curve goes up, everyone feels good about the quarter.

The problem isn't that deflection rate is useless. It's structurally blind to what the users actually wanted in the conversations the agent abandoned. The metric measures what succeeded. It tells you nothing about the shape of what failed.

Think of Google Analytics. Page views tell you what's working. Exit pages tell you where your product is broken. Exit pages are almost always more valuable for deciding what to build next. (Happened to me in GA4 once. 6 months building features based on what people looked at the longest, which turned out to be the pricing page because they were confused, not interested. Textbook side quest. I was farming the wrong mob entirely. That stung.) You don't fix a 38% exit rate on your checkout page by congratulating yourself on the 62% that made it through. You look at where people left and ask why.

Agent logs work the same way. You're optimizing the 47%. You're not reading the 53%.

One more reason to start: if your agents run on a CLI-native architecture, those logs come out structured. Not a soup of conversational text but actual machine-readable output you can pipe into an audit prompt without spending an afternoon cleaning data first. I went into the full argument for why CLI-native agents produce more exploitable logs, worth reading before you try this audit on messy logs and wonder why the clustering comes back useless.

Your Failure Log Is a Demand Map

Not everything your agent refuses is an opportunity. Get that out of the way first. Deliberate blocks, third-party auth walls, destructive operations, things that are out of scope by policy: those are working correctly. They're not product gaps. The filter is simple: frequency combined with reformulation equals a real signal. A single occurrence is probably noise.

4 signals worth reading in what your agent abandons, from strongest to most subtle.

Signal 1: Recurring refused questions. When different users hit the same wall with the same type of question across a 30-day window, that wall is a feature request with a sample size attached. You're not looking at 1 confused user. You're looking at demand you haven't served yet.

Signal 2: Repeated in-session reformulations. A user who rephrases the same request 3 or 4 times without getting a useful response isn't confused about how the product works. They're persistent because the need is real and they can't find it anywhere else. Loop count is a measure of how much they want it.

Signal 3: Systematic fallback patterns. Look at what category of request consistently hits fallback, not individual requests, categories. "Users want to compare 2 options directly." "Users want a scheduled follow-up." These aren't random failures. They're the outline of what your users actually want versus what you built.

Signal 4: High-effort retries. When a user makes the same request 5 times in a single session with increasing specificity each time, it's not a confused user. It's someone who can't find what they need anywhere else. Retry count is a proxy for willingness to pay. The higher the effort, the more likely you're looking at a premium-tier need that has no home in your current product.

1 caveat I actually mean: the signals are real. Your interpretation of them might not be. Run the audit, but don't build directly from the output without at least some informal validation first. A 30-minute call with 2 users beats a high-confidence AI clustering every time.

The failure log is the only product spec your users actually wrote.

What 3.5 GB of Logs Told Me

I ran the audit on 3.5 GB of Claude Code transcripts: 2,801 files, 30 days, 101 projects.

The most countable pattern was exactly what the dashboard had been telling me: 112 auth fallbacks on third-party panels. Visible, expected, and completely unactionable because the fix depends on vendors who have no reason to care about my use case. I've known about those for months.

That was my 47%.

What was hiding in the other column looked like this. Around 20 raw occurrences of a reformulation loop, which sounds like noise until you look at what's inside those sessions and realize that "pareil" or "still broken" came back 2 to 5 times per session, each time the same sequence playing out: the agent delivered a UI feature and declared it done, I tested manually on mobile, something broke, the agent restarted from a slightly different angle the next turn, still unable to register that "done" meant working on all surfaces and not just in the sandbox it had access to during the session, and by the 4th loop I was burning time I hadn't planned for on a problem I technically had "resolved" twice already. By loop 4 it felt like a Dark Souls run where someone had patched out the bonfire and I was still trying the same dodge pattern on the same boss. (Dark Souls at least tells you "YOU DIED" so you know when to stop.)

The failure log was encoding something specific: I needed a blocking hook that refuses to close a turn without real end-to-end proof, a test and not a declaration, desktop and mobile. 1 day of work to build, and it was hidden in 20 lines of log that no one was reading.

Then 7 occurrences of something qualitatively different: content that smelled like AI, the kind of generic taglines and off-persona copy that lands on a lifestyle site and immediately signals that a human didn't write it. (I can always tell when it happens. There's a specific flatness to it, like the model reached for the first acceptable phrasing and stopped there.) This wasn't external user demand. It was an internal spec gap: no content quality gate with explicit anti-patterns and persona constraints per site. The failure log didn't surface a user feature request. It surfaced an absence I had never formalized. Different mechanism from the loop pattern, same source.

The dashboard was counting 112 auth fallbacks. The actual cost was in the loops and the 7 off-tone deliveries. And if 7 out of 2,800 files sounds too small to care about: each one cost me a full manual rewrite session.

Run the 53% Audit on Your Logs

3 prompts. Copy-paste into Claude with your agent logs. Run them in sequence, or just start with Prompt 1 if you're short on time.

Prompt 1: Cluster refused requests and detect reformulation loops (Signals 1+2)

Analyze these agent logs and do the following:

1. Extract all requests the agent refused, couldn't handle, or handed off to fallback.
2. Cluster them by theme. Name each cluster in plain language.
3. For each cluster: count frequency, flag any session where the same request
   was reformulated 2+ times without success.
4. Classify each cluster: feature gap | bug | out of scope by design | unclear.
5. Estimate potential value: low | medium | high (based on frequency and retry effort).

For each cluster, output: name, frequency, reformulation sessions, category, potential value.

Close with 1 sentence: "If you had to build 1 thing from this output, it would be ___."

Prompt 2: Isolate reformulation loops (Signal 2 deep dive)

From these agent logs, find all sessions where the user reformulated the same 
request 2 or more times without a successful response.

For each session:
- What was the original request?
- How many times did the user rephrase?
- What was the final outcome (fallback / partial / abandoned)?
- Does this pattern repeat across multiple sessions, or is it a one-off?

Group sessions by underlying need, not exact phrasing. Separate genuine 
recurring demand from one-time friction.

Prompt 3: Score by frequency x effort (Signals 3+4)

From these agent logs, score each failure pattern:
Frequency x User effort = Priority score

User effort = number of retries + reformulation count per session.

Rank all patterns by priority score, highest first.
For the top 5: what would it take to handle this request successfully? 
1 sentence per pattern.

Flag any pattern where user effort is high but frequency is low.
Those are your high-value outliers.

Prompt 2 specifically surfaces something worth reading with a wider frame: repeated session loops don't just mean "the agent failed repeatedly," they signal what your agent is being asked to become. I unpacked what repeated session loops signal about your agent in detail, the framing applies directly here.

2 Reads, Same Logs

IKEA didn't build a better chatbot. They read the abandonment queue and asked what it was telling them about the business they hadn't built yet.

You have the same logs. You're probably not reading them (maybe I'm wrong about that, maybe your team already runs this audit every sprint and the failure queue is empty and you're building precisely what users can't find anywhere else). In 3 years of running agents across different products and pipelines, I've never seen that setup. What I see is a deflection rate on a dashboard and a log folder nobody opens.

Paste your logs into Prompt 1. Read the potential value column. Check whether what surfaces is something you've known about for months but never built.

If the answer is yes, that's your €1.3B hiding in plain sight.

Sources

This post may contain affiliate links. If you click them, I might earn a small commission (costs you nothing, and helps me keep shipping quality articles every day for your reading pleasure).


IKEA's chatbot didn't fail at the 47% it resolved, it succeeded by mapping the 53% it couldn't handle into €1.3B of new revenue. If your agent logs are unstructured text soup, you're already blind to that signal, which is why the CLI blueprint in the welcome kit structures agent output as machine-readable JSON from day one.

Get the welcome kit