Most AI Models Would Run Your Company Into the Ground, Princeton's CE…

A new benchmark from Princeton University's Z-Lab, CEO-Bench, placed 14 AI systems in full operational control of a simulated SaaS company for 500 simulated days, starting with $1 million in seed capital and zero customers. Of the 14, only three language models grew the starting capital: Claude Fable 5 posted about $47.15 million, Claude Opus 4.8 at roughly $27.8 million, and GPT-5.5 at about $21.3 million. Five models went bankrupt before the simulation ended: GLM 5.1, Claude Haiku 4.5, Gemini 3 Flash, DeepSeek V4 Pro, and Grok 4.20. A fourth "profitable" contestant was a hardcoded rule-based algorithm with no language model involved, finishing with about $15.76 million and beating every other LLM in the field. The environment included 34 tools and 19 database tables, with decisions mirroring real executive work such as setting pricing tiers and allocating R&D budgets, under delayed feedback.