SWE-Bench verified
OTIS Mock AIME 2024-2025
GPQA diamond
FrontierMath Private
Source:
Epoch AI Benchmarking Hub