AI Model Comparison
Speed, pricing, capabilities, and recommendations across AI model providers·Why →
Throughput (tokens per second)
| Model | Provider | Planning | Coding | Vision | Research | Creative | Average |
|---|---|---|---|---|---|---|---|
| GPT-5.4 Pro | OpenAI | 95 | 94 | 84 | 95 | 86 | 90.8 |
| GPT-5.4 | OpenAI | 93 | 93 | 84 | 93 | 85 | 89.6 |
| Gemini 3.1 Pro | Google Vertex | 93 | 88 | 81 | 94 | 90 | 89.2 |
| Claude Opus 4.6 | Anthropic | 92 | 88 | 74 | 91 | 95 | 88.0 |
| Kimi K2.5 | Moonshot AI | 87 | 88 | 90 | 92 | 80 | 87.4 |
| Qwen 3.5 397B | Alibaba Cloud (Qwen) | 88 | 88 | 85 | 90 | 85 | 87.2 |
| Gemini 3 Flash | Google AI Studio | 82 | 84 | 81 | 90 | 88 | 85.0 |
| Qwen 3.5 27B | Alibaba Cloud (Qwen) | 86 | 85 | 82 | 86 | 82 | 84.2 |
| GPT-5.3-Codex | OpenAI | 87 | 92 | 75 | 85 | 78 | 83.4 |
| Claude Sonnet 4.6 | Anthropic | 85 | 86 | 72 | 84 | 88 | 83.0 |
| GLM 5 | SiliconFlow | 83 | 84 | 64 | 87 | 80 | 79.6 |
| Gemini 3.1 Flash-Lite | Google AI Studio | 78 | 76 | 77 | 80 | 78 | 77.8 |
| MiniMax M2.5 | MiniMax | 90 | 91 | 35 | 89 | 80 | 77.0 |
| GLM 4.7 | Cerebras (Direct) | 76 | 80 | 58 | 82 | 74 | 74.0 |
| Claude Haiku 4.5 | Anthropic | 72 | 75 | 73 | 70 | 78 | 73.6 |
| GPT-5.3-Codex-Spark | OpenAI | 90 | 93 | 10 | 82 | 75 | 70.0 |
| Grok Code Fast 1 | xAI | 68 | 78 | 55 | 72 | 70 | 68.6 |
| Llama 3.1 8B | Taalas | 45 | 55 | 10 | 42 | 40 | 38.4 |
News & Updates
Latest developments in AI model performance and infrastructure
GPT-5.4 — OpenAI's 1M Context Unified Model Replaces Codex Line
OpenAI has released GPT-5.4, their most ambitious model consolidation yet. It merges the previously separate Codex coding line, reasoning capabilities, and general knowledge into a single model — and adds native computer-use as a first for OpenAI's mainline models. The headline feature is a 1,050,000-token context window, but there's a catch: input beyond 272K tokens costs 2x ($5/M input, $22.50/M output vs standard $2.50/$15), and long-context performance degrades significantly. While GPT-4.1 scored 100% on needle-in-haystack at 1M tokens, real-world agentic tasks show diminishing returns as context grows — models lose track of earlier instructions, hallucinate references, and exhibit attention drift. OpenAI acknowledges this by training GPT-5.4 with "compaction" to compress trajectories, but independent evaluations are still pending. For most use cases, the sweet spot remains under 256K tokens. The biggest wins are in agentic benchmarks: OSWorld jumps to 75% (surpassing the 72.4% human baseline), GDPval hits 83% across 44 professions, and ARC-AGI-2 reaches 73.3%. On coding, it matches GPT-5.3-Codex on SWE-Bench Pro (57.7% vs 56.8%) while adding much stronger general knowledge. GPT-5.2 Thinking is scheduled for deprecation on June 5, 2026, with GPT-5.4 positioned as the successor. Codex continues to run on the GPT-5.4 family, and OpenAI's priority tier keeps the higher-throughput path at premium pricing, but the public baseline remains the standard GPT-5.4 API at roughly 78 tok/s and $2.50/$15. That leaves it materially faster than Claude Opus 4.6 while staying close to Gemini 3.1 Pro on price.
Mercury 2 Brings Diffusion LLMs Back Into the Latency Race
Inception launched Mercury 2 as a faster and cheaper follow-up to the original Mercury line. Officially, the company says Mercury 2 reaches 1,009 tokens per second on Blackwell GPUs with 128K context and $0.25/$0.75 per million token pricing. Independent tracking is more conservative but still unusually fast: Artificial Analysis' latest public snapshot puts Mercury 2 around 655 tok/s, which keeps it far ahead of mainstream frontier APIs on direct latency. The tradeoff is capability ceiling rather than raw speed. Mercury 2 is best read as a throughput-first model for short-loop agentic work, low-latency chat, and interactive coding assistance where response speed matters more than absolute benchmark leadership.
Taalas HC1 Pushes Silicon Llama to ~17K tokens/sec
Taalas says its HC1 chip can run Silicon Llama 3.1 8B at about 17K tokens per second per user, which is still far beyond GPU-class direct inference. The design hardwires the model into custom silicon instead of relying on HBM-heavy accelerator stacks, and Taalas claims roughly 10x lower power than conventional hardware. The public hardware details remain striking: TSMC 6nm, 815mm² die size, 53B transistors, a 24-person team, and about $169M raised. The key caveat is quality: Taalas explicitly says the first-generation Silicon Llama is aggressively quantized with mixed 3-bit and 6-bit weights, so it does not match full-precision GPU baselines on quality. Even with that caveat, the speed headroom is unusual enough to make near-zero-latency chat, instant summarization, and multi-step agent loops practical in ways general cloud inference still struggles to match.
Help us keep this accurate
Found a wrong price, missing model, or outdated benchmark? Open an issue or send a pull request — every fix helps the community.
