OpenAI announces GeneBench-Pro, a benchmark test to measure the scien…

OpenAI has released GeneBench-Pro, a computational biology benchmark that measures whether an AI can analyze data by determining whether it is noise or significant. OpenAI points out that scientific data rarely comes with instructions, and that in real research, AI must make sophisticated judgments beyond recalling facts or following predefined workflows. GeneBench-Pro is designed to measure higher-order abilities such as correcting assumptions, handling ambiguity, and selecting appropriate analytical pathways. Each problem includes realistic but disorganized datasets and experimental backgrounds, requiring the AI to explore data, select methods, and conduct an iterative process. When tested, OpenAI's highest-performing model, 'GPT-5.6 Sol,' achieved a pass rate of 28.7% at the highest inference setting and 31.5% at the top-tier 'Pro' setting. When GeneBench development began, the then-latest model, 'GPT-5,' scored less than 5%, and OpenAI states that 'if the current pace of progress continues, this benchmark could saturate by the end of the year.' At the lowest inference level, the pass rate for GPT-5.6 Sol remained in the single digits, but at the highest inference level, it increased by approximately six times. Human reviewers estimated that it would take a human expert approximately 20 to 40 hours to solve one GeneBench-Pro problem.