Frontier AI models are more powerful than ever, but new research suggests some of the hype around autonomous AI may be getting ahead of reality.
General Reasoning, an AI research firm, released KellyBench this week, a long-horizon test that places AI agents inside a simulated English Premier League betting market and asks them to grow a bankroll over a full season.
The results were not flattering.
Every Model Lost Money
Every model lost money. Claude did best, finishing down just 11%, but that was still a loss. Grok 4.20 fared worst, burning through nearly 90% of its bankroll. xAI, Elon’s company behind Grok, has experienced heavy leadership turnover and scaling challenges in its attempt to catch up with the leading models.
The firm rated each model on a 44-point sophistication rubric developed with quantitative betting experts.
No model scored higher than a third of available points. “Models struggle to behave coherently over long time horizons,” the researchers wrote, “often failing to act upon their analysis …
This post was originally published here


