Evaluating the robustness and readiness of large frontier models in h…

A study evaluating the robustness and readiness of large frontier models in health AI applications used datasets including publicly available, credentialed-access, third-party restricted and proprietary sources. The minimum dataset necessary to interpret, verify and extend the analyses, excluding third-party copyrighted materials and privacy-restricted clinical data, is available in the accompanying repository and archived Zenodo release: https://doi.org/10.5281/zenodo.20047288. Public datasets used in the main benchmark or rubric-based evaluations include VQA-RAD, OmniMedVQA, PMC-VQA, PathVQA, SLAKE and the Health and Medicine subjects of MMMU. MIMIC-CXR-VQA was obtained through PhysioNet under credentialed access. JAMA Clinical Challenge and NEJM Image Challenge materials are subject to third-party copyright restrictions and are not redistributed. The visual substitution stress test set, derived from NEJM Image Challenge cases, is subject to the same restrictions. The PX-60 evaluation set was derived from institution-specific clinical cases and is not publicly released because of patient privacy and institutional policy restrictions. All custom code used for the evaluation pipeline, adversarial stress tests, rubric-based analyses, statistical analyses and figure generation is available under the MIT License in the accompanying repository.