⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
This Latent Space newsletter announces the discontinuation of SWE-Bench Verified as a reliable benchmark for coding AI models due to saturation and contamination. It features a discussion with Mia Glaese and Olivia Watkins from OpenAI's Frontier Evals team, who explain the reasons behind this decision and endorse SWE-Bench Pro as a more suitable alternative. The discussion also explores the future of coding evaluations, emphasizing the need for more complex, real-world tasks and human-intensive evaluation methods.
- Benchmark Contamination: Frontier models have been exposed to SWE-Bench problems during training, leading to models regurgitating solutions verbatim.
- Flawed Tests: Over 60% of the remaining problems in SWE-Bench Verified are deemed unsolvable due to overly narrow or overly broad test specifications.
- Endorsement of SWE-Bench Pro: OpenAI is officially moving away from SWE-Bench Verified and recommending SWE-Bench Pro as a more challenging and less contaminated benchmark.
- Future of Coding Evals: The focus is shifting toward longer-term tasks, open-ended design decisions, code quality, real-world product building, and human-intensive evaluations.
- Preparedness Framework: OpenAI's work on coding evals is tied to their Preparedness Framework, which aims to track and mitigate potential risks associated with advanced AI capabilities.