15 min read
Breaking: The Wall Street Journal reported tonight that NVIDIA plans to launch a new processor designed to help customers build faster, more efficient AI systems. That’s the headline. But the WSJ piece doesn’t tell you what kind of processor, why it matters, or how it connects to a trail of technical disclosures NVIDIA has been dropping for months. For that, you need to follow the breadcrumbs.
Let me explain how I got here.
Last week in A Pre-NVIDIA GTC Thesis Update, I laid out a three-piece convergence thesis for what NVIDIA is building toward on the Feynman roadmap: stacked memory, backside power delivery, and a new compiler paradigm. All three aimed at breaking through the memory wall that constrains AI inference economics. I listed “Groq integration signals” as a top GTC watch item and teased “a fourth piece I haven’t discussed yet — optical die-to-die interconnect from ISSCC.”
This is that piece.
The catalyst was semiconductor analyst Irrational Analysis, who published an unfinished draft called “It’s the Dataflow, Stupid” — the most technically rigorous breakdown of the Groq deal I’ve read anywhere. This is someone who spent years publicly attacking Groq’s architecture. He went through their patents, read their Hot Chips presentations, analyzed their compiler model, and fully reversed his position. His article opened my eyes and got me thinking. I skipped dinner tonight putting this together because I had to get it out — his core argument, that the real value isn’t SRAM but a compiler-driven dataflow architecture, fundamentally changed how I think about what NVIDIA bought for $20 billion. I owe him credit for cracking this open. He is a legend! What follows is my attempt to translate his technical argument for investors, connect it to the ISSCC and optical breadcrumbs I’ve been tracking, and map it onto the co-design thesis we’ve been building at BEP Research.
So I traced the breadcrumbs. The $20 billion Groq licensing deal in December. The optical clock-forwarded die-to-die link presented at ISSCC in February. The Spectrum-X Photonics and Quantum-X Photonics switches unveiled at CES. Dennis Abts — Groq’s former chief architect and the inventor of Google’s dragonfly topology — quietly joining NVIDIA three years ago. Jensen himself comparing Groq to Mellanox on the Q4 earnings call this week.
And then today, Jonathan Ross — Groq’s founder, inventor of Google’s TPU, now inside NVIDIA — tweeted: “If it feels like things are moving fast, brace yourself. This is the slowest things will ever move ever again.” The same day the WSJ story drops. The same week Jensen opened NVIDIA’s earnings call by emphasizing that 2026 is the Year of the Horse in the Chinese zodiac — the symbol of speed. That’s not a coincidence.
Individually, these are interesting data points. Together, they outline something specific: a dedicated dataflow inference engine that uses compiler-driven scheduling, optical clock synchronization, and on-chip SRAM — bypassing CoWoS advanced packaging and HBM entirely.
And now the Wall Street Journal is confirming the direction.
I want to translate the technical argument for BEP Research readers, map it onto the thesis we’ve been building, and walk through the investment implications ahead of Jensen’s GTC keynote on March 16.
Start with the technical disclosure most people missed.
At ISSCC this month, NVIDIA presented an optical clock-forwarded die-to-die link. This is not a standard SerDes paper. Clock-forwarding means one die transmits its clock signal alongside the data, and the receiving die locks onto that forwarded clock to achieve cycle-accurate synchronization between separate pieces of silicon.
The engineering is non-trivial. The system uses injection-locked oscillators — feedback circuits that reinforce the primary frequency tone while suppressing harmonics and jitter. NVIDIA built redundancy into every lane, with all lanes capable of receiving the forwarded clock. That’s an area-expensive design choice you make for yield optimization at production scale, not for a research demo.
Having spent time in Professor Fainman’s ultrafast optics lab at UC San Diego working on silicon photonics, I can tell you that injection locking at these speeds is real engineering. The bandpass filter design for the forwarded clock is sensitive to process, voltage, and temperature variation. Getting this right across production volumes — with tunable delay elements on both clock and data lanes — represents years of circuit design iteration.
When I covered NVIDIA’s optical strategy in The AI Datacenter Optical Interconnect Boom and the CES coverage of Spectrum-X Photonics, I framed these as scale-out networking advances. The ISSCC paper changes the frame. This is scale-up technology — die-to-die, not rack-to-rack.
Why does NVIDIA need cycle-accurate synchronization between dies? GPUs don’t require it. NVLink doesn’t require it. The current Blackwell architecture doesn’t require it.
A dataflow architecture requires it.
Why this matters in plain English: Optical clock forwarding lets NVIDIA connect multiple separate chips so they behave as one coherent processor. Same clock, same timing, every cycle. For a GPU, this is nice but not essential — hardware schedulers handle timing mismatches dynamically. For a dataflow architecture where the compiler pre-schedules everything assuming perfect synchronization across all chips, it’s existential. Without it, the system breaks the moment one chip drifts by a few cycles. With it, you can scale a dataflow engine from one die to an entire rack, and the compiler still treats the whole thing as a single unified machine. That’s the unlock. That’s what Groq alone could never solve. This is Mellanox-level integration for the compute fabric — at rack scale.
The popular framing of Groq is “the SRAM chip” — a processor with lots of fast on-chip memory instead of external HBM. That framing is incomplete in a way that matters for valuing the deal.
As Irrational Analysis argues, the real differentiator is dataflow architecture. Specifically: Groq pushes all scheduling decisions — every memory access, every data movement, every operation — into the compiler before the chip runs. The hardware makes zero dynamic decisions at runtime. It executes a pre-computed schedule with cycle-accurate precision.
If Groq is “the SRAM chip,” then NVIDIA bought a niche memory architecture. Interesting but limited. That’s roughly where I framed it in December when I wrote in ‘Twas the Night Before Groq that NVIDIA bought a boundary condition.
If Groq is a compiler-driven dataflow architecture, then NVIDIA bought something more fundamental: the scheduling intelligence to orchestrate any heterogeneous memory hierarchy. That capability slots directly into the three-piece convergence — and the ISSCC clock-forwarding paper is the physical interconnect that makes it work across multiple dies.
A conventional AI chip — GPU, TPU — makes thousands of real-time decisions while running. Which data to fetch next. Where to store intermediate results. How to handle a memory miss. Hardware manages this dynamically, which provides flexibility but introduces unpredictability. Operations stall waiting for data. Utilization drops below theoretical peaks.
Groq eliminates all of that. The compiler schedules every operation before the chip turns on. The chip itself is radically simple — no caches, no dynamic memory management, no branch prediction. Just execution units following a pre-computed script.
Technically, this is a 144-wide VLIW (very long instruction word) architecture — 144 parallel operations scheduled per cycle. For context, Google’s TPU uses 8-wide VLIW for its control unit. Groq went 18 times wider. Groq’s own chief architect confirmed this characterization at Hot Chips 2022.
The on-chip SRAM is not a cache (hardware-managed) but a scratchpad (software-managed). The compiler decides exactly what data lives where and when. No hardware guessing. No cache misses. If the compiler gets it right, the chip achieves near-theoretical peak performance.
The tradeoff: the most demanding compiler ever built. Irrational Analysis estimates the hardware could be replicated by a small team in six months. The compiler, refined over six-plus years of iteration running a money-losing inference cloud, is where the real IP lives. That compiler team now works at NVIDIA.
Groq’s architecture has one critical vulnerability that explains why it remained a startup burning cash instead of displacing GPUs: chip-to-chip synchronization.
The compiler pre-schedules everything assuming perfect timing across all chips in the system. Every chip. Every server. Every rack. If one chip drifts by even a few clock cycles, execution stalls — by design. The compiler cannot recover from timing skew it didn’t anticipate.
Groq’s existing synchronization relies on counter-based schemes that, as Irrational Analysis documents by reading their patents, are fundamentally fragile. SerDes jitter accumulates. PPM clock drift between independent oscillators on separate chips is unavoidable. Temperature gradients across a rack change propagation delays over time. These are physics problems, not engineering oversights.
Now look at what NVIDIA presented at ISSCC: optical clock-forwarded die-to-die links with injection locking.
Clock forwarding solves the synchronization problem at the physical layer. Instead of each chip running an independent clock and trying to compensate for drift after the fact, a forwarded clock ensures all dies operate from the same timing reference. Injection locking cleans up the remaining jitter. The tunable delay elements on both clock and data lanes handle the residual process and temperature variation.
This is the breadcrumb that connects the ISSCC paper to the Groq deal. NVIDIA didn’t present optical clock-forwarding because Blackwell needs it. They presented it because a dataflow inference engine needs it — and they’re building one.
Now extend the optical thread further. At CES, NVIDIA unveiled Spectrum-X Photonics with TSMC COUPE silicon photonics — 102.4 Tb/s switching bandwidth with co-packaged optics. I covered this in the CES deep dive. The Rubin Ultra generation (2027) is expected to incorporate CPO into scale-up networking — GPU-to-GPU via NVSwitch.
A dataflow inference engine connected by optically clock-forwarded links, within a photonic fabric, achieves something no electrical interconnect can: deterministic synchronization at rack scale without the jitter and drift penalties that made Groq’s standalone system fragile. The optical layer doesn’t just improve bandwidth. It enables the architecture.
Irrational Analysis identifies five layers of NVIDIA IP that solve Groq’s known weaknesses. This is where the co-design thesis operates at full scale — no standalone company has all five.
Layer 1: Groq’s compiler DNA. Six years of building a whole-graph static scheduler for the most constrained programming model in computing. The compiler team, not the chip design, is the irreplaceable asset. That team now sits inside NVIDIA.
Layer 2: Clock-forwarded optical interconnect. The ISSCC paper. Cycle-accurate synchronization between dies using injection-locked oscillators — the physical-layer solution to Groq’s biggest vulnerability. Extends naturally into NVIDIA’s broader photonic roadmap (Spectrum-X Photonics, TSMC COUPE, Rubin Ultra CPO).
Layer 3: Hybrid bonding for 3D SRAM stacking. We tracked this across The Memory Wars and Raja Was Right. NVIDIA’s packaging roadmap enables stacking SRAM at densities Groq alone couldn’t achieve — the same principle AMD proved with 3D V-Cache, applied at datacenter scale to build large scratchpad memory.
Layer 4: NVLink fabric. Groq’s existing chip-to-chip communication is bandwidth-constrained. Inside NVIDIA’s NVLink domain, a dataflow engine inherits 1.8 TB/s per-GPU interconnect bandwidth with NVLink 6 scaling further on Rubin.
Layer 5: Thermal infrastructure. Stacking SRAM on top of compute dies generates concentrated heat. NVIDIA’s liquid cooling systems — developed through multiple generations of high-TDP datacenter GPUs — are directly applicable. This is an underappreciated enabler.
In last week’s update, I listed six watch items for Jensen’s March 16 keynote. The breadcrumb trail — and now the WSJ report — adds specificity to what I’m looking for.
Any mention of a dedicated inference architecture separate from the GPU training stack. Not a smaller GPU. Not an inference-optimized SKU. A fundamentally different compute architecture for inference workloads — one that uses compiler-driven static scheduling rather than dynamic hardware scheduling.
References to deterministic execution or static scheduling in the context of new hardware. This is the linguistic marker that Groq’s architectural DNA has entered the product roadmap.
Optical die-to-die in the context of inference. If Jensen connects the ISSCC clock-forwarding paper to an inference product — rather than keeping it in the networking discussion — that confirms the breadcrumb trail leads where I think it leads.
Inference products that bypass CoWoS and HBM. A dataflow engine built on standard packaging with on-chip SRAM would use entirely different manufacturing resources than training GPUs. This is the supply chain signal that makes the financial model additive.
If NVIDIA reveals a Groq-derived inference product direction at GTC — or signals one on the Feynman roadmap — the implications cascade across the AI infrastructure landscape.
For NVIDIA’s financial model: A dedicated inference product that doesn’t compete for CoWoS or HBM supply creates incremental revenue without cannibalizing training GPU production. Training GPUs remain supply-constrained through at least 2027. A parallel inference line using standard packaging, on-chip SRAM, and mature process nodes unlocks TAM the street isn’t modeling. As I wrote last week, NVIDIA at ~14x CY2027 earnings is priced for deceleration. A second product line growing simultaneously shifts that calculus.
For custom ASIC and TPU competitors: The competitive thesis for Google TPU, Amazon Trainium, Cerebras, Etched, and inference-focused startups has been: GPUs are overkill for inference, purpose-built silicon wins. That argument weakens considerably if NVIDIA ships its own purpose-built inference architecture with full platform integration — CUDA ecosystem, TensorRT-LLM, NVLink, Dynamo. Google’s absence from MLPerf Inference v6.0 already signals pressure. A dedicated NVIDIA inference engine accelerates it.
For the supply chain: Bullish for SRAM capacity demand, hybrid bonding equipment (Besi, EV Group), and standard packaging providers. The optical interconnect ecosystem — Lumentum, Coherent, Astera Labs, Credo — gains another demand driver if optically-synchronized inference engines reach production. HBM demand for training GPUs continues growing independently, but a parallel inference product that uses zero HBM caps the incremental demand narrative slightly.
For model architecture companies: SSM-Transformer hybrids like AI21’s Jamba — covered in The Memory Wall — become more valuable. These architectures are purpose-designed for memory-constrained inference with linear-scaling KV cache that fits within SRAM capacity. Silicon-model co-design isn’t theoretical. It’s the business model.
In December, I framed the Groq deal as defensive — NVIDIA buying a boundary condition to study, then reintroducing “memory, CUDA, fault tolerance, and generality where needed.” In the GTC preview, I upgraded the framing: “Groq could be as instrumental to NVIDIA as Mellanox.” Jensen used the same comparison on this week’s earnings call.
Following the breadcrumbs from ISSCC through the optical roadmap, I’m landing closer to the Mellanox comparison. The reported $20 billion — nearly 3x Groq’s September valuation — the four-day talent integration, the ISSCC clock-forwarding paper, the optical die-to-die trajectory, and now the WSJ report all point toward a product bet, not a research investment.
The three-piece convergence from last week — stacked memory, backside power, the compiler — has a fourth piece arriving faster than expected: optically-synchronized dataflow inference. Hardware so simple the compiler does everything. A compiler so sophisticated it took six years and a cloud business to train. Optical interconnect so precise it solves the synchronization problem that kept the architecture from scaling.
That’s co-design. That’s what $20 billion buys. And if any of it shows up at GTC on March 16, the competitive landscape for inference changes permanently.
I’ll be at the SAP Center watching Jensen’s keynote live. Subscribers will hear about it in real time. Register for GTC here — virtual access is free.
Lock in the early bird rate before March 9th →
NVIDIA isn’t just improving GPUs. The breadcrumbs — Groq’s compiler team, ISSCC’s optical clock-forwarding, CES photonics, and now the WSJ report — point toward a second product line: a dedicated dataflow inference engine that uses no HBM, no CoWoS, and no dynamic scheduling. The compiler does everything before the chip wakes up. Optical interconnect keeps the dies synchronized. If this surfaces at GTC, it means NVIDIA’s inference TAM is additive to training GPU revenue, not a cannibalization story. Every custom ASIC startup just lost their differentiation narrative.
About the Author
Ben Pouladian is a Los Angeles-based tech investor and entrepreneur focused on AI infrastructure, semiconductors, and the power systems enabling the next generation of compute. He was co-founder of Deco Lighting (2005-2019), where he helped build one of the leading commercial LED lighting manufacturers in North America. Ben holds an electrical engineering degree from UC San Diego, where he worked in Professor Fainman’s ultrafast nanoscale optics lab on silicon photonics and micro-ring resonators, and interned at Cymer, the company that manufactures the EUV light sources for ASML’s lithography systems.
He currently serves as Chairman of the Leadership Board at Terasaki Institute for Biomedical Innovation and is a YPO member. His investment research focuses on AI datacenter infrastructure, GPU computing, and the semiconductor supply chain. Long-term NVIDIA investor since 2016.
Follow on Twitter/X: @benitoz | More at benpouladian.com
Disclosure: The author holds positions in NVIDIA and related semiconductor investments. This is not investment advice.
Leave a Reply