MLPerf Inference v6.0 dropped today. The headline number: 2.7x.
Same GB300 NVL72 rack. Same power envelope. Same silicon. But 2.7x more tokens per second on DeepSeek-R1 in the server scenario, achieved purely through software optimization of TensorRT-LLM and the open-source Dynamo inference framework.
That’s a 60%+ reduction in the cost to produce each token, delivered to every operator already running Blackwell infrastructure, without buying a single new chip.
Disclosure: NVIDIA invited BEP Research to an embargoed pre-brief on the MLPerf v6.0 results ahead of release. During that briefing, NVIDIA confirmed that Google had not submitted Ironwood for this round. The analysis below is independent editorial work.
Why 2.7x Is the Number That Matters
Wall Street models NVIDIA as a hardware company. Different chip, new revenue cycle. But MLPerf v6.0 just demonstrated something the sell-side hasn’t fully priced: NVIDIA’s software stack is a second revenue engine that compounds on top of the hardware cycle.
Six months ago, GB300 NVL72 debuted on MLPerf v5.1 with DeepSeek-R1 server throughput of 2,907 tokens/sec/GPU. Today, same hardware: 8,064 tokens/sec/GPU. That’s 2.77x.
This is the pattern I’ve been tracking in The Memory Wall and the Co-Design Series: stack depth compounds over time. Last round it was 2.8x. This round, 2.7x on next-generation hardware. The playbook repeats.
What Actually Changed
The 2.7x came from coordinated optimization across the entire inference pipeline.
Faster kernels and kernel fusions. Fewer launches, tighter execution. This requires intimate knowledge of the hardware — warp scheduling, memory hierarchy, tensor core pipeline. This level of optimization is much harder without tight control over the hardware-software boundary.
Disaggregated serving via Dynamo. Separates prefill and decode for independent optimization. Dynamo classifies which parts of the model are compute-bound versus memory-bound and routes accordingly.
Wide Expert Parallel and Multi-Token Prediction. WideEP shards MoE experts across GPUs to reduce weight-loading bottlenecks. MTP uses idle compute during memory-bound decode to predict additional tokens in parallel.
KV-aware routing. Routes requests to GPUs that already hold relevant cached context — Azure benchmarks showed over 20x faster time to first token versus round-robin.
These are the same optimization levers I’ve been tracking across NVIDIA’s inference stack. MLPerf now validates them at audited production scale.
At the hardware level, Blackwell’s SM100 was designed with hooks for this software to exploit: a new tensor memory tier managed explicitly by software, collaborative multi-SM matrix multiply, and TMA multicast for eliminating redundant cache traffic. NVIDIA controls both sides of the optimization loop. That’s why 2.7x isn’t a one-time event — it’s a pattern.
The Competitive Moat
Jensen said it at GTC: “Inference is thinking, and thinking is hard.”
Custom ASICs can be highly efficient on targeted workloads. What they have not demonstrated publicly is this kind of full-stack optimization velocity across changing models and scenarios.
NVIDIA was the only platform to submit on all newly added models and scenarios in v6.0. Their cumulative record: 291 MLPerf wins since 2018, 9x all other submitters combined. The best DeepSeek-R1 result came from Nebius — a partner optimizing on NVIDIA’s open-source stack.
The Dog That Didn’t Bark
Google did not submit its TPU v7 Ironwood to MLPerf Inference v6.0 — the second consecutive round they’ve sat out. BEP Research reported the absence first on X on February 16. Tae Kim at Key Context confirmed it with three independent sources.
Ironwood is a capable chip: 4,614 FP8 TFLOPS, 192GB HBM3e, 7.4 TB/s bandwidth. Google markets it as built for “the age of inference.” But they publish internal benchmarks while declining to submit to the one test where results are peer-reviewed and directly comparable to NVIDIA.
The absence doesn’t prove Ironwood can’t compete. But it removes the most advanced custom silicon competitor from the only standardized, auditable comparison. For investors evaluating the bear case that hyperscaler ASICs will erode NVIDIA’s inference share, that’s a data point worth weighing.
Below the paywall: the three investment conclusions MLPerf v6.0 changes right now, why every AI software company is a token reseller facing a margin event, which parts of the stack benefit most as inference scales, and what the market still isn’t pricing.





Leave a Reply