Benchmark Scores Don’t Break. Clinical Reality Does. The Health AI Re…

A landmark evaluation published this month in *Nature Medicine* examined leading frontier models across health AI applications, stress-testing their robustness against real-world clinical conditions, and found that benchmark performance and clinical robustness are measuring two different things. The article describes a "Benchmark Trap," noting that medical AI benchmarks like MedQA, PubMedQA, and MedMCQA are constructed from curated, static datasets with zero adversarial pressure. A 2024 study in *JAMIA* showed GPT-4 with baseline prompting achieved an F1 score of 0.804 on a concept-extraction task using the MTSamples dataset, while BioClinicalBERT achieved an F1 of 0.9, outperforming the larger frontier model on real clinical data. A May 2026 evaluation of 15 proprietary frontier LLMs from OpenAI, Anthropic, Google, Amazon, and xAI measured multi-turn attack success rates ranging from 7.89% to 88.30% across models, and single-turn rates ranging from 2.19% to 64.91%.