On Tuesday, Google published a blog post about TurboQuant, a compression technique that shrinks AI inference memory usage by 6x or more with near-zero accuracy loss. The market responded over two days: SanDisk fell 11% on Thursday alone and is down 18% over the past five trading days. Micron dropped roughly 7% on Thursday and has fallen nearly 20% over the same stretch. SK Hynix and Samsung were hit in Asia too. The logic was familiar. If you can compress inference memory by 6x, you need less memory. Less memory means less DRAM. Less DRAM means sell Micron.
That logic is wrong. And NVIDIA just shipped the software that proves it.
What TurboQuant Actually Is
When an AI model generates a response, it needs to remember what it has already processed. That memory is called KV cache. It’s the scratchpad the model uses to keep track of prior context so it doesn’t have to reprocess everything from scratch with each new word. The longer the conversation or document, the bigger the scratchpad gets.
TurboQuant compresses that scratchpad. Instead of storing each piece of context at full precision, it uses roughly 3.5 bits per value instead of the standard 16. The result: the same GPU can hold more context in memory, serve more users simultaneously, or process longer documents.
The technique is not new. The research paper dates to April 2025. It’s listed as an ICLR 2026 poster. What happened Tuesday was Google resurfacing it through an official blog post, and the market treating it as a new development.
It gets worse. The lead author of RaBitQ, a competing method that TurboQuant benchmarks against, has publicly accused Google of promoting flawed experimental comparisons. Gao Jianyang, a postdoc at ETH Zurich, published a detailed rebuttal stating that these issues were flagged to the TurboQuant team before the conference submission, that the team acknowledged the problems, and chose not to correct them. The paper was accepted anyway, then amplified through Google’s official channels to tens of millions of social media views. The market sold billions of dollars in memory market cap on a paper whose own benchmark comparisons are now being contested by the researchers it cites.
Here’s the critical distinction most coverage missed:
What KV cache compression affects: The working scratchpad the GPU uses during inference. How many users it can serve at once. How long a context window it can support.
What it does not affect: The model itself — its weights, its training data, its stored knowledge. None of that changes. The vast majority of AI memory and storage demand is untouched.
NVIDIA has been optimizing along this axis for years. They already deploy compressed KV cache in production (FP8) and are pushing toward NVFP4, which cuts the scratchpad footprint by another approximately 50%. This direction of travel was never in question.
The question is: what happens with the savings? And that’s where the market gets the story wrong.
Why Efficiency Doesn’t Kill Demand
I’ve written about this pattern repeatedly, starting with The Memory Wars and most recently in The Token Explosion. The dynamic is called the Jevons paradox, and it works like this: when you make a resource cheaper to use, people don’t use less of it. They use far more.
Think of it like highway lanes. When a city adds lanes to reduce congestion, traffic initially improves. But the cheaper commute attracts more drivers, and within a few years the highway is just as packed — with more total cars than before. The efficiency gain didn’t reduce driving. It made driving more attractive.
KV cache compression works the same way. When you shrink the per-query memory footprint, operators don’t pocket the savings and buy fewer GPUs. They reinvest into longer conversations, more simultaneous users, and new workloads that weren’t economically viable before.
The timing of the TurboQuant selloff makes this painfully clear. The same week the market sold memory stocks on a compression paper, two things happened at Anthropic:
First, Anthropic announced they’re rationing Claude session limits during peak hours because demand is exceeding capacity. They’ve shipped efficiency improvements — and they’re still compute-constrained. Efficiency gains absorbed. Demand still growing. Infrastructure still not enough.
Then, hours later, details leaked about Capybara, a next-generation model that reportedly surpasses Claude Opus 4.6 on software coding, academic reasoning, and cybersecurity benchmarks. The catch? Anthropic described it as “a large, compute-intensive model” that is “very expensive for us to serve, and will be very expensive for our customers to use.” They’re working to make it more efficient before they can even release it. Frontier models are getting more compute-hungry, not less.
Meanwhile, Gemini and Claude both offer 1M+ token context windows as shipping products. Models are ingesting entire codebases, long tool traces, user histories, and persistent memory states. Lower KV cache cost per token makes it economical to keep more context active rather than discarding it. From a memory standpoint, that is not bearish. It’s the precursor to sustained demand growth.
Why TurboQuant May Not Scale
There’s a second problem with the demand-destruction narrative: TurboQuant’s results come primarily from smaller, open-source models. There is limited evidence that this kind of aggressive compression works cleanly on the largest frontier models in real production environments.
Here’s the intuition. AI inference has two phases. In the first phase (prefill), the model reads your entire prompt at once — that step can tolerate extra processing overhead. In the second phase (decode), the model generates its response one word at a time, and every extra millisecond of delay compounds. Compression adds overhead to the decode phase — decompressing values, managing lookups — and at frontier model sizes with hundreds of billions of parameters, those costs can erode the gains.
DeepSeek V4 is a live example. Its release slipped from February to April, with reports that one key factor is instability from aggressive KV cache compression during training. What works on a 7-billion-parameter test model doesn’t always survive at 600 billion parameters with reinforcement learning and millions of concurrent users.
There’s also a signal in where Google published this. Google keeps tight control over which research it shares externally. Technologies already deployed in Gemini typically stay internal. That TurboQuant was published openly suggests it may not be production-ready at the scale that matters most.
Why Dynamo Is the Real Story
While the market was focused on a compression paper, NVIDIA quietly shipped Dynamo 1.0 — and almost nobody noticed.
Dynamo is NVIDIA’s operating system for AI inference at scale. Think of it as the air traffic control system for an AI data center: it decides which GPU handles which request, where to store context so it can be reused, and how to move data between different types of memory depending on what’s needed and how fast it’s needed.
As I wrote in The Token Explosion: “Without Dynamo, Rubin and LPX are two separate chips. With it, they’re one machine.”
Here’s what matters about Dynamo 1.0, and why it tells you which direction memory demand is heading:
Smart routing. When a user sends a follow-up message, Dynamo routes it to the GPU that already has their prior context cached. No need to rebuild the scratchpad from scratch. In Azure’s benchmarks, this cut response times by 20x compared to random assignment. The more cache you keep alive, the better this works.
Cache pinning for AI agents. Today’s AI agents run multi-step tasks — writing code, running tests, reading documentation — in sessions that last hours. Traditional systems treat all cached context the same and will discard it when space gets tight. Dynamo lets operators pin high-value context so it resists deletion. NVIDIA is building software to keep cached data alive longer, not to discard it faster.
Four-tier memory management. KV cache is growing too large to fit on a single GPU. Dynamo manages a hierarchy: Tier 1 is fast GPU memory (HBM4, 288 GB per chip at 22 TB/s on the upcoming Rubin platform). Tier 2 is ultra-fast on-chip memory (SRAM on the Groq LPU, 500 MB at 150 TB/s). Tier 3 is rack-scale storage on NVIDIA’s new BlueField-4 hardware — a product category that didn’t exist a year ago, purpose-built for KV cache overflow. Tier 4 is cloud storage for cold context that might be needed later. NVIDIA built four tiers because they expect the scratchpad to outgrow any single one.
Agent-aware scheduling. Dynamo accepts metadata about each request — how latency-sensitive it is, how long the response will be, whether the context should be preserved. Running with NVIDIA’s NeMo Agent Toolkit, this delivered 4x faster response times and 1.5x higher throughput. The system is purpose-built for the agentic workloads that generate the most sustained memory demand.
Every feature NVIDIA built into Dynamo 1.0 is designed to manage, preserve, route, and extend KV cache. This is NVIDIA telling you, through production software now deployed at CoreWeave, ByteDance, Together AI, and AstraZeneca, that the future involves more cached state, not less.
What the Market Keeps Missing
I predicted exactly this reaction before GTC in my Pre-GTC Thesis Update: “If Jensen showcases any dedicated inference engine that bypasses HBM, expect a knee-jerk reaction in memory suppliers. Micron drops 3-4% on the headline… That logic is wrong.”
The playbook keeps repeating. An efficiency improvement gets announced. The market extrapolates linearly: if each query uses less memory, total demand falls. Stocks sell off. It happened with DeepSeek in early 2025. It’s happening again now.
But the underlying economics haven’t changed. The cost per token goes down. The number of tokens per task goes up — because smarter models think longer, agents run continuously, and context windows keep expanding. Net infrastructure demand increases.
The question was never how many bytes per token get saved. It’s what the industry does with those savings. Dynamo 1.0 answers that with production code: run more context, serve more users, keep agent memory alive, build workloads that weren’t possible before.
That’s the curve that matters.
If You Found This Useful
This post is free because the TurboQuant selloff hit a lot of portfolios this week, and I wanted the analysis out quickly for anyone who needs it.
But this is one piece of a much larger framework. If you’re an investor or allocator trying to understand where AI infrastructure demand is actually headed, here’s what paid subscribers have been reading:
The Memory Wars — the full architecture thesis: why NVIDIA’s 2028 roadmap ends the AI chip competition, the HBM4 race, 3D-stacked SRAM, and the $20B Groq deal.
The Token Explosion — six layers of the AI factory mapped from GTC 2026, with exclusive interviews from three NVIDIA executives. Dynamo, LPX, the five pricing tiers, and the $300B-per-gigawatt math.
The Reasoning Tax — why GPT-5.4’s benchmarks prove smarter models don’t save infrastructure, they devour it.
Paid subscribers also get real-time analysis when events like this move markets, and coverage connecting silicon architecture to investment positioning that you won’t find anywhere else.
If this is the kind of analysis that helps you think, consider subscribing. The volatility is with us. The thesis will be ready.
Disclosure: I hold positions in NVDA, LITE, CRDO, ALAB, LSCC, TSEM, ORCL (2027 LEAPS), and BE. This is not investment advice — do your own research.
Resources
-
The Memory Wars: Why NVIDIA’s 2028 Architecture Ends the AI Chip Competition
-
The Token Explosion: Why GTC 2026 Maps the AI Infrastructure Future
About the Author
Ben Pouladian is a Los Angeles-based tech investor and entrepreneur focused on AI infrastructure, semiconductors, and the power systems enabling the next generation of compute. He was co-founder of Deco Lighting (2005–2019), where he helped build one of the leading commercial LED lighting manufacturers in North America. Ben holds an electrical engineering degree from UC San Diego, where he worked in Professor Fainman’s ultrafast nanoscale optics lab on silicon photonics and micro-ring resonators, and interned at Cymer, the company that manufactures the EUV light sources for ASML’s lithography systems.
He currently serves as Chairman of the Leadership Board at Terasaki Institute for Biomedical Innovation and is a YPO member. His investment research focuses on AI datacenter infrastructure, GPU computing, and the semiconductor supply chain. Long-term NVIDIA investor since 2016.
Follow on Twitter/X: @benitoz | More at benpouladian.com


Leave a Reply