13 min read
Here’s the dirty secret of GPU-based inference: most of the silicon sits idle, waiting for data.
We’ve spent a decade optimizing neural networks for compute. Bigger models, more parameters, longer context windows. But for autoregressive inference—the token-by-token generation that powers every chatbot, coding assistant, and AI agent—the bottleneck isn’t floating-point operations. It’s memory bandwidth. The arithmetic units finish their work and then wait, cycle after cycle, for the next chunk of weights to arrive from memory.
This isn’t a temporary engineering problem. It’s a fundamental constraint that’s reshaping how we design both chips and models. And it explains why two seemingly unrelated innovations—Groq’s SRAM-based inference chips and AI21’s hybrid SSM-Transformer architecture—are converging toward the same solution.
At CES 2026, Jensen Huang made this explicit: “Moore’s Law can’t keep up with 10x and exponential use of AI. We need to embrace extreme co-design.” The Vera Rubin platform—with its BlueField-4 Context Memory system adding 16TB of distributed KV cache per rack—is NVIDIA’s system-level answer to the memory wall. But to understand why that matters, you need to understand the problem it’s solving.
To understand why memory bandwidth dominates inference, you need to understand how LLMs actually generate text.
Prefill is the first phase: processing the entire input prompt in parallel. Every token in your prompt gets computed simultaneously. This is compute-bound work—matrix multiplications that fully utilize GPU cores. Modern GPUs excel here.
Decode is the second phase: generating output tokens one at a time. Each new token requires loading the model’s weights from memory, performing a relatively small computation, then writing the result back. This happens sequentially—you can’t generate token 47 until you’ve generated tokens 1-46.
The difference in arithmetic intensity is stark. During prefill, you might perform hundreds of operations per byte loaded from memory. During decode, that ratio collapses. The GPU’s compute units finish their work almost instantly, then stall while the memory system catches up.
For most production inference workloads—especially interactive applications where users wait for responses—decode dominates total latency. And decode is fundamentally memory-bound.
NVIDIA’s latest inference optimizations validate this framework directly. Their January 2026 TensorRT-LLM release introduced disaggregated serving—physically separating prefill and decode into different processing pipelines. Why? Because these workloads have fundamentally different memory/compute profiles. Treating them identically leaves performance on the table.
Performance engineers use a framework called the “roofline model” to visualize this constraint. On one axis: arithmetic intensity (operations per byte of memory traffic). On the other: achievable throughput.
Every processor has two ceilings:
-
Compute ceiling: the maximum FLOPS the chip can deliver
-
Memory ceiling: the maximum bytes per second the memory system can deliver, multiplied by arithmetic intensity
Your actual performance hits whichever ceiling is lower.
For LLM decode workloads, arithmetic intensity typically falls in the range of 50-100 operations per byte—well below modern GPU compute-to-bandwidth ratios of 150-300+ ops/byte. This means decode often runs at only 30-50% of theoretical compute capacity. The rest is wasted—silicon sitting idle, burning power, waiting for data.
Scale this up to H100 or Blackwell and the ratios shift, but the fundamental dynamic remains. For autoregressive decode, memory bandwidth—not compute—sets the performance ceiling.
This creates what semiconductor analysts at Semi Doped recently called a “design spectrum” for inference hardware: SRAM-only → SRAM+DDR → SRAM+HBM. Each point on that spectrum represents a different tradeoff between bandwidth, capacity, and cost. Understanding where different workloads fall on this spectrum is key to understanding the market structure emerging around inference
.
The memory wall thesis isn’t theoretical. NVIDIA just proved it empirically.
In January 2026, NVIDIA released updated TensorRT-LLM optimizations for Blackwell. Running DeepSeek-R1 on GB200 NVL72, the new software delivered 2.8x higher token throughput—on the exact same hardware shipped in October 2025.
The gains came from innovations that directly address the memory wall:
-
Disaggregated serving: Separating prefill and decode phases into optimized pipelines
-
Enhanced all-to-all communication: Reducing inter-GPU data movement overhead
-
NVFP4 precision: New 4-bit floating point format that cuts memory bandwidth requirements
-
Multi-Token Prediction (MTP): Predicting multiple output tokens per forward pass
On HGX B200, the results were even more dramatic. Enabling MTP and NVFP4 delivered 4x+ throughput improvement over the FP8 baseline in aggregated serving scenarios.
This is co-design in action. The software stack was optimized specifically for the hardware’s memory hierarchy. Competitors can match chip specs on a datasheet. Matching continuous software optimization across the entire inference stack—quarter after quarter—is far harder.
Groq’s Language Processing Unit (LPU) attacks this problem by eliminating the memory hierarchy entirely.
Instead of HBM (High Bandwidth Memory) sitting off-chip, each Groq chip contains 230MB of on-chip SRAM. SRAM is physically closer to the compute units and operates at dramatically higher bandwidth—up to 80 TB/s internally, compared to H100’s 3.35 TB/s HBM3 bandwidth. That’s roughly 24x more bandwidth per chip.
The tradeoff is capacity. 230MB per chip means large models don’t fit on a single chip—or even a single rack. Running Llama 3 70B requires hundreds of LPU chips distributed across multiple racks. The model weights are distributed across chips, with a high-speed interconnect fabric handling communication.
But for inference, this tradeoff pays off in latency and determinism:
-
Speeds of 500-1,000+ tokens per second on 70B models, with peaks above 1,500 T/s using speculative decoding (vs. 60-100 T/s for H100 in typical deployments)
-
Deterministic latency: no variable queuing delays, no batching jitter, predictable response times
-
No memory bandwidth bottleneck: compute units stay fed
This is co-design in its purest form. The silicon constraint (SRAM-only, no HBM) forced a radically different system architecture (distributed model weights across hundreds of chips). You can’t evaluate the chip in isolation—it only makes sense as part of a complete deployment system.
The architecture also explains NVIDIA’s interest. On December 24, 2025, NVIDIA announced a non-exclusive licensing deal with Groq, reportedly valued at approximately $20 billion, hiring CEO Jonathan Ross and senior leadership while licensing LPU technology. GroqCloud continues operating independently, but the deterministic inference concepts are now flowing into NVIDIA’s roadmap.
As I wrote in “Twas the Night Before Groq”: NVIDIA didn’t buy Groq’s architecture—they bought a boundary condition. Groq answers a specific question: what does inference look like when the graph is static, scheduling is solved offline, and memory stalls are eliminated entirely? The answer is valuable even if you never ship the architecture directly.
But here’s the critical nuance that’s emerged since the deal: GPUs and HBM are not dead. As the Semi Doped podcast put it bluntly: “LPUs solve a different problem—deterministic, ultra-low-latency inference for small models. Large frontier models still require HBM-based systems.”
The inference market is fragmenting by workload type, not converging on a single architecture.
While Groq attacked the memory wall with novel silicon, AI21 Labs attacked it with novel model architecture.
The problem with standard Transformers is the KV cache—the key-value pairs stored for each token in the context window. During decode, every new token must attend to all previous tokens, requiring the full KV cache to be loaded from memory. For long contexts, this becomes enormous: a 70B Transformer with 256K context can require over 100GB of KV cache alone—more than the total memory on most GPUs.
Jamba combines Transformer attention layers with Mamba state-space model (SSM) layers in a 1:7 ratio. SSM layers don’t use attention—they maintain a fixed-size hidden state that summarizes all previous tokens. This means:
-
Up to 32x reduction in KV cache: roughly 4GB instead of 128GB for 256K context, according to AI21’s benchmarks
-
Linear scaling with sequence length: where Transformers scale quadratically
-
Feasible long-context inference on memory-constrained hardware
The 1:7 ratio isn’t arbitrary. AI21’s research found that some attention is still necessary for certain reasoning tasks, but most layers can be SSM without quality degradation. The architecture is explicitly designed around memory constraints—not as an afterthought, but as a first-order design principle.
Even NVIDIA’s most advanced systems can’t escape the memory wall through brute force alone.
Vera Rubin NVL72 packs 20.7TB of HBM4 per rack—an enormous amount of memory. But for trillion-parameter mixture-of-experts models with long context windows, even that isn’t enough. The KV cache alone can exceed what HBM can hold.
NVIDIA’s answer isn’t just more HBM. It’s a tiered memory hierarchy:
-
HBM4: 288GB per GPU at 22TB/s — fast but expensive and capacity-limited
-
Network (NVLink 6 + ConnectX-9): 3.6TB/s — enables distributed KV cache across GPUs
-
Flash/NVMe SSD: Petabytes of capacity, lower latency than network, 10x cheaper than HBM
The BlueField-4 DPU sits at the center of this architecture, managing KV cache lifecycle—deciding what stays in HBM, what gets pushed to NVMe, and what gets shared across the rack. Flash latency is lower than network latency, so local SSD becomes a viable tier in the memory hierarchy.
Credit @success_cw
This is system-level co-design. The DPU, the NVMe controllers, the network fabric, and the GPU memory system all have to work together. You can’t buy this capability off the shelf—it requires coordinated design across silicon, firmware, and software.
Groq’s SRAM architecture and Jamba’s SSM-hybrid design aren’t just compatible—they’re complementary responses to the same fundamental constraint.
Groq provides massive bandwidth but limited capacity. Jamba provides dramatic memory reduction at the model level. Together, they enable workloads that neither pure Transformer + HBM architectures nor pure SRAM + standard models can handle efficiently:
-
Long-context inference (256K+ tokens) without memory explosion
-
Deterministic latency for real-time applications
-
Higher throughput per watt by eliminating memory stalls
This is what I mean by co-design. Not just “hardware and software working together”—that’s table stakes. Real co-design means the boundaries dissolve. The silicon constraints shape the model architecture. The model architecture informs the deployment topology. You can’t separate the layers because they were designed as a unified system.
It’s like asking whether a bird’s wing is hardware or software. The wing’s physical structure and the flight pattern are one thing—you can’t optimize them independently.
The Semi Doped podcast surfaced practical use cases where deterministic, sub-millisecond inference creates real value—not just benchmarking bragging rights:
-
Ad copy personalization at search latency budgets: When you have 50ms to personalize content before a page renders, every millisecond of inference latency eats into your decision window
-
Model routing and agent orchestration: When an AI system needs to decide which specialist model handles a query, the router itself can’t add meaningful latency
-
Real-time translation and conversational interfaces: The difference between 200ms and 50ms response time changes whether AI feels like a tool or a participant
-
Robotics and physical AI at the edge: Control loops that run at 100Hz can’t wait for cloud inference round-trips
-
AI-RAN and telecom infrastructure: 5G network optimization happens at timescales where traditional inference is too slow
These aren’t hypothetical. They’re the workloads driving investment in deterministic inference architectures—and they represent a distinct market segment from training frontier models or serving them at scale.
For AI infrastructure investors and builders, the memory wall isn’t going away. HBM4 and CXL will help at the margins, but the fundamental physics—off-chip memory is slow, on-chip memory is limited—creates structural pressure toward integrated solutions.
Watch for:
-
Silicon-model co-design: architectures explicitly designed for specific memory hierarchies
-
SSM and hybrid architectures: Jamba is early, but linear-scaling models will proliferate
-
Deterministic inference for specific workloads: not replacing GPUs, but serving latency-critical applications that GPUs can’t address efficiently
-
System-level memory solutions: like NVIDIA’s BlueField-4 Context Memory Platform, which adds distributed KV cache at the rack level rather than trying to solve everything on a single chip
-
Continuous software optimization: NVIDIA’s 2.8x throughput gain on existing hardware shows that stack depth compounds over time
The era of treating hardware and software as independent optimization problems is ending. The winners will be those who design them together.
Next in this series: Part 2 examines how NVIDIA is responding—not by building better GPUs, but by acquiring entire layers of the inference stack. The Israel strategy, the Groq deal structure, and what “stack depth” means for competitive dynamics. Coming tomorrow.
For NVIDIA’s full 2028 roadmap—including Feynman’s 3D-stacked SRAM and 1TB HBM4E per GPU—see The Memory Wars.
If you found this analysis valuable, please share it—it helps more than you know. And if you haven’t subscribed yet, now’s the time. BEP Research will be moving to paid soon, and early subscribers will be grandfathered in. I’m committed to delivering institutional-quality analysis on AI infrastructure that you won’t find anywhere else.
Subscribe to BEP Research →
About the Author
Ben Pouladian is a Los Angeles-based tech investor and entrepreneur focused on AI infrastructure, semiconductors, and the power systems enabling the next generation of compute. He was co-founder of Deco Lighting (2005–2019), where he helped build one of the leading commercial LED lighting manufacturers in North America. Ben holds an electrical engineering degree from UC San Diego, where he worked in Professor Fainman’s ultrafast nanoscale optics lab on silicon photonics and micro-ring resonators, and interned at Cymer, the company that manufactures the EUV light sources for ASML’s lithography systems.
He currently serves as Chairman of the Leadership Board at Terasaki Institute for Biomedical Innovation and is a YPO member. His investment research focuses on AI datacenter infrastructure, GPU computing, and the semiconductor supply chain. Long-term NVIDIA investor since 2016.
Follow on Twitter/X: @benitoz | More at benpouladian.com
Disclosure: The author holds positions in NVIDIA and related semiconductor investments. This is not investment advice.
Leave a Reply