The 95% Problem

6/23/26

Jun 23, 2026

6/23/26

Section 1

Section 2

Section 3

Companion to The Vibe Coding Trap. That piece argued from first principles that the model is the easy part. This one looks at the same question through the market data — what a run of 2025–2026 research says about why AI pilots stall, and what the ones that work have in common.

Most operators have seen the number: 95% of corporate generative AI pilots produce no measurable business value. It comes from an MIT report published in the summer of 2025 and has been repeated everywhere since. The number itself has been challenged — critics, including Wharton's Kevin Werbach, argue the methodology doesn't support a precise 95%, so take it as directional rather than exact. The pattern underneath is the part that holds, and dropping the stat into an AI conversation has become shorthand for being level-headed about the technology.

The Vibe Coding Trap made a related argument from first principles. The AI model call is about 10% of an enterprise-grade agentic system. The other 90% — integration plumbing, the audit trail, access controls, the eval framework, cost monitoring, the deployment pipeline — decides whether the system holds up once the business depends on it. That piece reasoned from how enterprise software works. This one looks at the same question through market data: research from MIT NANDA, Gartner, Forrester, and reporting through 2026 from CIO, TechRadar, ITPro, and others.

The reports converge. Underneath the 95% is a consistent pattern: the failures sit in the engineering around the model, not in the model itself — and the same research that documents the failure describes what the working systems do differently.

1. The finding underneath the headline

The MIT NANDA report, The GenAI Divide: State of AI in Business 2025, is the source of the 95% figure. Two findings past the headline matter more than the number.

The first is why pilots fail. Not the models, infrastructure, talent, or regulation — the gap is learning, integration, and workflow adaptation. The tools that stall can't hold context, don't improve from feedback, and don't bend to how the work runs. MIT frames it as a divide between high adoption and low transformation: near-universal use, almost no P&L impact.

The second finding gets less airtime. Solutions bought from specialized vendors or built through partnerships succeeded about twice as often as systems built internally. The instinct to build it yourself — the one a cheap, working prototype encourages — is among the more reliable predictors of landing in the 95%.

That result lines up with the Vibe Coding Trap's first-principles claim: the work that breaks you is the work around the model, and the team best positioned to do that work is rarely your own, building from scratch.

2. What changed in 2026

Through 2024 and 2025, a failed pilot mostly cost the pilot. In 2026 the pressure changed, on two fronts: returns and cost.

On returns: CIO's reporting on the year describes management teams now looking for returns inside twelve months, and names the failure pattern directly — teams that sprayed across visible use cases instead of asking where AI would make the company measurably better. Boards want returns in dollars, not pilot counts.

On cost: a survey by Asana, covered by ITPro, found more than eight in ten UK IT leaders hit unplanned AI cost increases over the prior year. Token-based billing made the real cost of agentic systems legible, and companies including Uber began flagging how hard it is to tie token spend to visible improvement. The same impatience reached executives and investors: G-P's 2026 AI at Work Report found nearly 70% of executives ready to cut AI budgets if this year's goals aren't met, and a Teneo survey covered by Axios found 53% of investors expecting AI returns within six months — far faster than most large-cap CEOs think they can deliver.

Models improved on price and speed across this period. The new pressure comes from the cost of the surrounding 90%, at companies that scoped and budgeted for the 10%.

3. Where the failures actually land

Across a dozen reports that all say AI is failing, the useful question is where. The Vibe Coding Trap grouped the requirements of an enterprise-grade system into three buckets — the Plumbing (can it connect, securely?), the Brain (does it make good decisions?), and the Operating System (can you run it, trust it, improve it?). The market failures map onto those buckets, and concentrate in the third.

Forrester, covered by ITPro, describes the build/buy decision in operators' terms. It names "platform confusion" — teams frozen between a SaaS agent, a systems-integrator build, and a custom internal build — as a reason initiatives never leave pilot. Forrester's split: teams treating agentic AI as a feature experiment stay in pilots; teams investing in agent-native design, executable governance, and the operational scaffolding around the model are the ones that scale.

TechRadar's June 2026 analysis makes the same point: agents that run in a sandbox fall apart once they touch live systems and face audit and risk scrutiny, and those are not problems a better model solves. Gartner adds the pilot-to-production drop-off — its widely cited forecast that at least 30% of generative AI projects would be abandoned after proof of concept by the end of 2025, a figure Gartner itself later put at 50% or more.

Evaluation is the piece teams discover last. Getting a system to produce an answer is the easy part; the production problem is knowing whether the answer is any good, catching it when quality slips, and changing the model, the prompts, or the workflow without quietly breaking what already worked. That is continuous measurement and observability, not a one-time build — the eval framework named in the 90%, and a common place pilots stall.

The shorthand across these independent sources: AI fails in the same places vibe-coded prototypes are weakest — integration, learning, governance, and cost control.

4. What the 5% have in common

"Buy beats build by 2x" might read as "buy a horizontal copilot and move on." The same research rules that out.

MIT describes the winners concretely. General-purpose chatbots spread because they're easy to try and flexible, and they fail in critical workflows for the same reason: no memory, no customization. The systems that cross the divide embed in a specific workflow, adapt to the operator's context, learn from feedback, and start from a narrow, high-value foothold before expanding into the core process. MIT also documents a "shadow AI" economy: employees getting more from tools they pick themselves than from sanctioned rollouts. That's a signal about adoption, not architecture. Those tools win at the individual task and still stall at workflow scale — the line the 5% cross and the horizontal seat doesn't.

That profile describes a vertical agentic startup built around a real workflow, with a design partner inside it. A horizontal SaaS seat doesn't deliver that, and neither does an internal team scoping against the 10%.

So the research goes past "partner instead of build" to the kind of partner that works. MIT's number is the cleanest version of the case: systems built through partnerships succeeded about twice as often as internal builds. A company co-founded around your workflow, with you as its design partner, is the most committed form of that partnership — the same structure the 2x came from, without the brittle, under-scoped build you would own alone. That description matches the companies Stackpoint builds.

5. Build, buy, or partner

The Vibe Coding Trap took up a specific case: an operator with a real pain point and no off-the-shelf product that already solves the workflow. That scope matters. Where a mature tool exists and fits, buying it is usually the right move, and most healthy AI adoption looks exactly like that. The build/buy/partner question is about the workflows where no good solution exists yet — which, in complex, high-barrier industries, is most of them.

For that case, the Trap laid out three paths: build internally, hire a dev shop, or partner with a venture studio building the company that solves it. It argued the third path on structural grounds — alignment, the AI trap, the embedded advantage, talent. The market data is a second line of support for the same conclusion.

Build internally is the path the data flags most clearly: the one-third-success-rate path in MIT's numbers. The failure modes — brittle integration, no learning loop, no audit trail, runaway token cost — sit in the buckets internal teams under-scope, because the prototype never needed them.
Hire a dev shop clears the talent and execution bar but not the structural one. A spec frozen at contract close can't absorb the continuous change — workflow, model layer, regulation — that these reports identify as the actual challenge. The dev shop leaves; the trust tax, the eval loop, and the model volatility stay.
Partner with the startup building for your market matches the profile of the 5%: a team whose equity value depends on the product working in your workflow for years, embedding alongside you while the architecture is still being set.

The structural logic and the market data point the same way.

6. Bottom line

The 95% number is well known. The two findings underneath it matter more: the failures sit outside the model, and the systems that work are bought or co-built with a partner embedded in the workflow, not assembled internally against a spec that only covered the easy 10%.

The Vibe Coding Trap reached this from first principles. The market data reaches it from the other direction, consistently across sources. If you're putting workflow, money, customers, or compliance exposure on an AI system, the gap between a prototype and dependable software is the thing to plan around. The reports locate that gap in integration, governance, and cost — and teams tend to hit it once the business already depends on what they built.

The cheapest build is still the one you don't have to rebuild. First principles and the market data agree on that much.

Stackpoint is a venture studio that co-founds and funds agentic AI companies in complex, high-barrier industries. We launch and invest in four companies per year, alongside operator design partners who help shape the product from day one.
If you have a workflow that feels like it should be solvable with AI — and the 95% number is the reason you've hesitated — we'd love to talk.

hello@stackpoint.com

Sources

MIT NANDA — The GenAI Divide: State of AI in Business 2025 (Challapally, Pease, Raskar, Chari; July 2025). Primary report: full text PDF · Coverage with lead-author interview: Fortune
On the contested 95% figure — methodology critique (Paul Roetzer): Marketing AI Institute. Wharton's Kevin Werbach raised the same concern on LinkedIn (post is behind LinkedIn's login wall).
CIO — "2026: The year AI ROI gets real" (Jan 13, 2026): CIO
Forrester — "The State of Agentic AI in 2026: Companies Are Chasing, Few Are Catching" (June 2026): Forrester · Coverage: ITPro
TechRadar — "Why most AI programs stall, and what it will take to scale them" (June 12, 2026): TechRadar
Asana, via ITPro — "IT leaders are being stung by 'unexpected' AI costs" (June 2026; 82% of UK IT leaders): ITPro
ITPro — "The AI pricing time bomb" (June 12, 2026; Uber and token-cost visibility): ITPro
G-P (Globalization Partners) — 2026 AI at Work Report (May 2026; nearly 70% of executives prepared to cut AI budgets if goals aren't met, 73% report AI ROI fell short): G-P press release
Teneo — Vision 2026 CEO & Investor Outlook Survey (Dec 2025; 53% of investors expect AI ROI in ≤6 months vs. 16% of large-cap CEOs): Teneo press release · Coverage: Axios
Gartner — "30% of Generative AI Projects Will Be Abandoned After Proof of Concept by End of 2025" (July 2024 forecast): Gartner newsroom · Revised figure (Gartner now reports at least 50% abandoned by end of 2025): Gartner article
Computing — "MIT report: 95% of corporate generative AI pilots fail to deliver returns" (Aug 2025): Computing