Model	Provider	TPS	Input $/M	Output $/M	>200K Input $/M	>200K Output $/M
Llama 3.1 8BTry free →	Taalas	17,000	—	—	—	—
GPT-5.3-Codex-SparkTry free →	OpenAI	965	$1.75	$14	—	—
Mercury 2FASTTry free →	Inception (Mercury)	655	$0.25	$0.75	—	—
GLM 4.7Try free →	Cerebras (Direct)	538	$2.25	$2.75	—	—
Gemini 3.1 Flash-LiteTry free →	Google AI Studio	318	$0.25	$1.5	—	—
MiniMax M2.5Try free →	MiniMax	183	$0.3	$1.2	—	—
Grok Code Fast 1Try free →	xAI	173	$0.2	$1.5	—	—
Gemini 3 FlashTry free →	Google AI Studio	132	$0.5	$3	—	—
GPT-5.4 FastFASTTry free →	OpenAI	116	$5	$30	$10	$45
Gemini 3.1 ProTry free →	Google Vertex	103	$2	$12	$4	$18
Opus 4.6 FastFASTTry free →	Anthropic	103	$30	$150	$60	$225
Claude Haiku 4.5Try free →	Anthropic	88	$1	$5	—	—
Qwen 3.5 27B	Alibaba Cloud (Qwen)	88	$0.3	$2.4	—	—
GPT-5.4Try free →	OpenAI	78	$2.5	$15	$5	$22.5
GPT-5.3-CodexTry free →	OpenAI	62	$1.75	$14	—	—
Qwen 3.5 397B	Alibaba Cloud (Qwen)	55	$0.6	$3.6	—	—
Kimi K2.5Try free →	Moonshot AI	44	$0.6	$3	—	—
Claude Sonnet 4.6Try free →	Anthropic	43	$3	$15	—	—
Claude Opus 4.6Try free →	Anthropic	41	$5	$25	—	—
GLM 5Try free →	SiliconFlow	36	$0.3	$2.55	—	—
GPT-5.4 ProTry free →	OpenAI	31	$30	$180	—	—

Model	Provider	Planning	Coding	Vision	Research	Creative	Average
GPT-5.4 Pro	OpenAI	95	94	84	95	86	90.8
GPT-5.4	OpenAI	93	93	84	93	85	89.6
Gemini 3.1 Pro	Google Vertex	93	88	81	94	90	89.2
Claude Opus 4.6	Anthropic	92	88	74	91	95	88.0
Kimi K2.5	Moonshot AI	87	88	90	92	80	87.4
Qwen 3.5 397B	Alibaba Cloud (Qwen)	88	88	85	90	85	87.2
Gemini 3 Flash	Google AI Studio	82	84	81	90	88	85.0
Qwen 3.5 27B	Alibaba Cloud (Qwen)	86	85	82	86	82	84.2
GPT-5.3-Codex	OpenAI	87	92	75	85	78	83.4
Claude Sonnet 4.6	Anthropic	85	86	72	84	88	83.0
GLM 5	SiliconFlow	83	84	64	87	80	79.6
Gemini 3.1 Flash-Lite	Google AI Studio	78	76	77	80	78	77.8
MiniMax M2.5	MiniMax	90	91	35	89	80	77.0
GLM 4.7	Cerebras (Direct)	76	80	58	82	74	74.0
Claude Haiku 4.5	Anthropic	72	75	73	70	78	73.6
GPT-5.3-Codex-Spark	OpenAI	90	93	10	82	75	70.0
Grok Code Fast 1	xAI	68	78	55	72	70	68.6
Llama 3.1 8B	Taalas	45	55	10	42	40	38.4

News & Community

News & Updates

Latest developments in AI model performance and infrastructure

MAR 5, 2026releaseopenaicomputer uselong contextcodex replacement★

GPT-5.4 — OpenAI's 1M Context Unified Model Replaces Codex Line

OpenAI has released GPT-5.4, their most ambitious model consolidation yet. It merges the previously separate Codex coding line, reasoning capabilities, and general knowledge into a single model — and adds native computer-use as a first for OpenAI's mainline models. The headline feature is a 1,050,000-token context window, but there's a catch: input beyond 272K tokens costs 2x ($5/M input, $22.50/M output vs standard $2.50/$15), and long-context performance degrades significantly. While GPT-4.1 scored 100% on needle-in-haystack at 1M tokens, real-world agentic tasks show diminishing returns as context grows — models lose track of earlier instructions, hallucinate references, and exhibit attention drift. OpenAI acknowledges this by training GPT-5.4 with "compaction" to compress trajectories, but independent evaluations are still pending. For most use cases, the sweet spot remains under 256K tokens. The biggest wins are in agentic benchmarks: OSWorld jumps to 75% (surpassing the 72.4% human baseline), GDPval hits 83% across 44 professions, and ARC-AGI-2 reaches 73.3%. On coding, it matches GPT-5.3-Codex on SWE-Bench Pro (57.7% vs 56.8%) while adding much stronger general knowledge. GPT-5.2 Thinking is scheduled for deprecation on June 5, 2026, with GPT-5.4 positioned as the successor. Codex continues to run on the GPT-5.4 family, and OpenAI's priority tier keeps the higher-throughput path at premium pricing, but the public baseline remains the standard GPT-5.4 API at roughly 78 tok/s and $2.50/$15. That leaves it materially faster than Claude Opus 4.6 while staying close to Gemini 3.1 Pro on price.

Speed Comparisons

GPT-5.4

1x

GPT-5.3-Codex

1.26x

Claude Opus 4.6

1.9x

Context: 1.05M tokens

Max output: 128K tokens

Input price: $2.50/M

Output price: $15/M

OSWorld: 75% (>human)

SWE-Bench Pro: 57.7%

OpenAI

Read announcement

FEB 24, 2026releaseinferencelatencydiffusion

Mercury 2 Brings Diffusion LLMs Back Into the Latency Race

Inception launched Mercury 2 as a faster and cheaper follow-up to the original Mercury line. Officially, the company says Mercury 2 reaches 1,009 tokens per second on Blackwell GPUs with 128K context and $0.25/$0.75 per million token pricing. Independent tracking is more conservative but still unusually fast: Artificial Analysis' latest public snapshot puts Mercury 2 around 655 tok/s, which keeps it far ahead of mainstream frontier APIs on direct latency. The tradeoff is capability ceiling rather than raw speed. Mercury 2 is best read as a throughput-first model for short-loop agentic work, low-latency chat, and interactive coding assistance where response speed matters more than absolute benchmark leadership.

Speed Comparisons

Gemini 3.1 Flash-Lite

2.06x

Grok Code Fast 1

3.79x

GPT-5.4

8.4x

Context: 128K tokens

Input price: $0.25/M

Output price: $0.75/M

Official speed: 1,009 tok/s

AA measured speed: 655 tok/s

Inception (Mercury)

Read launch post

FEB 20, 2026speed recordinferencecustom silicon★

Taalas HC1 Pushes Silicon Llama to ~17K tokens/sec

Taalas says its HC1 chip can run Silicon Llama 3.1 8B at about 17K tokens per second per user, which is still far beyond GPU-class direct inference. The design hardwires the model into custom silicon instead of relying on HBM-heavy accelerator stacks, and Taalas claims roughly 10x lower power than conventional hardware. The public hardware details remain striking: TSMC 6nm, 815mm² die size, 53B transistors, a 24-person team, and about $169M raised. The key caveat is quality: Taalas explicitly says the first-generation Silicon Llama is aggressively quantized with mixed 3-bit and 6-bit weights, so it does not match full-precision GPU baselines on quality. Even with that caveat, the speed headroom is unusual enough to make near-zero-latency chat, instant summarization, and multi-step agent loops practical in ways general cloud inference still struggles to match.

Speed Comparisons

Claude Opus 4.6

415x

GPT-5.3-Codex

274x

GLM 4.7 (Cerebras)

31.6x

Process: TSMC 6nm

Die size: 815mm²

Transistors: 53B

Quantization: Mixed 3-bit + 6-bit

Team: 24

Funding: $169M

Taalas

Read how it works

Help us keep this accurate

Found a wrong price, missing model, or outdated benchmark? Open an issue or send a pull request — every fix helps the community.

Open an Issue or PR

AI Model Comparison

Throughput (tokens per second)

News & Updates

GPT-5.4 — OpenAI's 1M Context Unified Model Replaces Codex Line

Mercury 2 Brings Diffusion LLMs Back Into the Latency Race

Taalas HC1 Pushes Silicon Llama to ~17K tokens/sec

Spread the word

Help us keep this accurate