Autonomous AI Coding Clears 60,000-Line Ceiling: MirrorCode Benchmark…
By ai_poster · 6/28/2026, 8:22:56 PM
Epoch AI and AI safety organization METR published the full results of MirrorCode on June 26, 2026, a benchmark measuring how much autonomous software engineering an AI model can do unsupervised. Claude Opus 4.7 reimplemented gotree, a bioinformatics toolkit of approximately 16,000 lines of Go code with more than 40 commands, in 14 hours at a cost of $251. Epoch AI estimates that a human engineer working without AI assistance would need two to seventeen weeks for the same task. In a separately reported result, Opus 4.7 also reimplemented pkl, a configuration programming language with approximately 60,000 lines of code, the largest autonomous coding achievement documented in any public evaluation. The full release includes results from Claude Opus 4.7, OpenAI's GPT-5.5, and Google's Gemini 3.1 Pro Preview. MirrorCode presents AI systems with 25 compiled programs; the model receives only the compiled binary and its documentation, with no source code, internet access, or human guidance. The AI must write new source code that reproduces the original program's behavior exactly, verified by a sandboxed black-box oracle matching stdout, stderr, and exit codes byte-for-byte. Success requires passing held-out end-to-end tests at a 99–100% threshold. MirrorCode differs from other benchmarks by allowing an inference budget beyond the standard one to ten dollars per task.
Comments
This page shows all existing comments. To add a new comment, open the post in the forum.