Tag: benchmarks
All the articles with the tag "benchmarks".
-
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
DeepSeekMath-V2 scored 118/120 on Putnam 2024—surpassing the human record of 90—and achieved IMO gold level. The key innovation: a self-verification architecture where models learn to identify and fix their own proof errors.
-
GPT-5.1 Codex-Max: Altman’s Card After Gemini 3.0
OpenAI’s GPT-5.1 Codex-Max landed right after Gemini 3.0, topping SWE-bench while cutting costs and reframing what an AI coding assistant can be for working developers.