Tag: benchmarks

All the articles with the tag "benchmarks".

GPT‑5.2-Codex: A Restrained Upgrade With Clearer Priorities

19 Dec, 2025

OpenAI’s GPT‑5.2-Codex hits state-of-the-art results on SWE‑Bench Pro and Terminal‑Bench 2.0, improves long-horizon refactors and Windows workflows, and leans hard into defensive cybersecurity.
The Gap Is Widening: What DeepSeek-V3.2 Tells Us About Two AI Futures

4 Dec, 2025

DeepSeek's latest paper admits the gap with closed-source models is growing. But the real story is how chip export controls forced two diverging AI paradigms—and what that means for the endgame.
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

30 Nov, 2025

DeepSeekMath-V2 scored 118/120 on Putnam 2024—surpassing the human record of 90—and achieved IMO gold level. The key innovation: a self-verification architecture where models learn to identify and fix their own proof errors.
GPT-5.1 Codex-Max: Altman’s Card After Gemini 3.0

20 Nov, 2025

OpenAI’s GPT-5.1 Codex-Max landed right after Gemini 3.0, topping SWE-bench while cutting costs and reframing what an AI coding assistant can be for working developers.

GPT‑5.2-Codex: A Restrained Upgrade With Clearer Priorities