Forget standardized tests and coding benchmarks. A new simulation called FoodTruck Bench just put AI models through something far more revealing: running an actual business.
The setup is simple but brutal. Each AI gets $2,000 in starting capital and a virtual food truck. Over 30 simulated days, they make real business decisions: where to park, what menu items to offer, how to price them, when to hire staff, and how to manage inventory. Same scenario for every model. Same 34 tools at their disposal.
The results? Twelve models tested. Eight went bankrupt.
The Winners and Losers
Claude Opus 4.6 dominated, generating $49,000 in revenue. GPT-5.2 came in second at $28,000. The gap between them is significant, roughly 75%.
But the most interesting finding isn't who won. It's why two-thirds of the models failed entirely.
Every Model That Took a Loan Went Bankrupt
This is the headline that should concern anyone using AI for financial decisions. Of the 12 models tested, 8 took out loans. All 8 of them went bankrupt. That's a 100% failure rate for leveraged AI decisions.
Why? The benchmark creator explained on Reddit that models consistently overestimated their ability to service debt while underestimating operational volatility. They'd take loans to expand before establishing reliable cash flow. Classic over-leveraging, the same mistake that kills real businesses.
The models that succeeded (Opus, GPT-5.2, and two others) shared a common trait: conservative capital management. They grew organically, reinvested profits, and avoided debt entirely.
Gemini's Infinite Loop Problem
In a quirk that reveals something about model architecture, Gemini 3 Flash Thinking was the only model out of 20+ tested that got stuck in an infinite decision loop. Not occasionally, but 100% of runs.
The model would freeze at decision points, unable to commit to a course of action. According to the benchmark writeup, this happened specifically when facing competing priorities with no clear optimal choice. Real business decisions, in other words.
Humans Still Crush AI at Business
One day after the benchmark launched, a human player hit $101,685, just 0.6% below the theoretical maximum. That's more than double what the best AI achieved.
The player reportedly took 9 runs on the same seed over about 10 hours. But here's the kicker: on a completely random seed with no prior knowledge, they still scored $91,000. That's still nearly double Opus.
What does this tell us? AI models can make reasonable business decisions. They can avoid obvious mistakes. But they lack the pattern recognition and adaptive reasoning that experienced humans bring to complex, multi-variable problems.
What This Means for Your AI Strategy
If you're using AI for business decisions, this benchmark offers some practical guidance:
- Don't let AI manage leverage. The 100% bankruptcy rate for loan-taking models is a red flag. Use AI for analysis and recommendations, but keep human oversight on financial commitments.
- Test under real conditions. Standardized benchmarks don't predict real-world performance. The gap between MMLU scores and business simulation results is massive.
- Favor conservative models. Opus and GPT-5.2 succeeded by being cautious. If your AI tool is aggressive in its recommendations, that might not be a feature.
- Decision paralysis is real. Some models freeze when facing ambiguous choices. If you're using AI for time-sensitive decisions, test for this specifically.
The Bigger Picture
FoodTruck Bench is clever because it measures something benchmarks usually ignore: judgment. Not knowledge, not reasoning speed, but the ability to make good decisions with imperfect information over time.
That's what business actually is. And right now, the best AI models are about half as good as a determined human at doing it.
The benchmark is open and playable if you want to test yourself against the models. Given that a human hit near-perfect scores in 10 hours, it's a good reminder that AI tools are exactly that: tools. The human running them still matters more.

