The arithmetic units inside a modern GPU spend much of LLM inference waiting.
An H100 has enormous tensor throughput, on the order of 2 to 4 PFLOP/s depending on precision and whether you count sparsity. But during autoregressive decode, the bottleneck is usually not multiplication. In low-to-moderate batch decode, each new token effectively requires the system to stream through the active model weights and attend over accumulated state, while the GPU’s HBM bandwidth (about 3.35 TB/s on an H100 SXM) is finite. Unless a workload performs hundreds of useful operations per byte fetched from memory, the tensor cores can’t stay fully occupied.¹
¹ The exact ratio depends on precision and accounting conventions, but the important point is stable: modern GPUs have hundreds of units of compute available for every byte they can pull from HBM. Autoregressive decode is far below that threshold.
And this problem has persisted across GPU generations. Peak tensor throughput has grown faster than external memory bandwidth for the decode phase, so the operational-intensity gap hasn’t closed. If anything it has widened.
That is the basic fact underneath most of the interesting AI hardware market right now. The useful question for any company in this space is which part of the memory problem they’re attacking, and whether solving it lets them avoid competing head-on with NVIDIA. The market is starting to organize around where you attack that constraint: inside the chip, inside the serving engine, inside the cache hierarchy, or inside the physical package and rack.
Groq ↗ built a chip with no HBM at all, replacing it with on-chip SRAM (memory built directly into the processor die, much faster than DRAM but also much less dense, so you get far less capacity per unit area). The whole execution model was deterministic. The compiler scheduled every cycle statically, no caches, no dynamic scheduling. You need many more chips to hold a large model, but you never wait for HBM. NVIDIA’s Groq deal changed the competitive frame: SRAM-style inference is no longer just a startup bet, it is now part of NVIDIA’s own platform roadmap. The result is the Groq 3 LPX inference accelerator, integrated into NVIDIA’s Vera Rubin ↗ platform alongside the main GPU.
Cerebras ↗ built a single chip that spans an entire silicon wafer, over 50x the area of an H100 die, with 44 GB of on-chip SRAM and 21 PB/s of internal memory bandwidth. Same thesis, different mechanism. They went public in May 2026.
MatX ↗ is building around scratchpad memories (software-managed fast memory, as opposed to hardware-managed caches) designed for the access patterns of transformer inference. d-Matrix ↗ is doing in-memory compute, moving the arithmetic into the memory array itself.
Every one of these is a different attempt to shrink or eliminate the gap between where data lives and where arithmetic happens.
Even with imperfect hardware, software can route around the bottleneck. A useful back-of-the-envelope roofline heuristic falls out of this. A crude roofline heuristic says the approximate batch size for decode scales like , where the 300 reflects the approximate ratio of peak compute to memory bandwidth on current hardware. For a dense model that’s about 300 tokens. For a DeepSeek ↗-style MoE where roughly 37B of 700B total parameters are active per token, the approximate batch is closer to 6,000. In practice, the true number moves around with precision, KV pressure, parallelism strategy, latency targets, and the serving stack, but the direction is what matters. Below that point, you are bandwidth-limited. Above it, you are increasingly compute-limited. At a rough hardware level, you can think of the system as running in cycles on the order of tens of milliseconds, roughly the time it takes to stream through HBM, with each cycle producing one new token per active sequence.
The scheduling problem that falls out of this is how to pack requests into these cycles to maximize throughput while hitting latency targets. Prefill is usually compute-heavy. Decode is usually memory-bandwidth-sensitive. They want different things from the hardware, and mixing them efficiently is hard.
RadixArk ↗ and Inferact ↗ have both recently launched into this. RadixArk is the SGLang ↗ team, building commercial infrastructure on top of SGLang for inference and Miles ↗ for RL training. Inferact is the vLLM ↗ team doing the same with vLLM. Both Berkeley projects, both widely deployed, now racing each other as companies.
Why is this interesting? Because the problem compounds. Every new model architecture, every new hardware platform changes the optimization surface. A team that builds deep expertise in roofline-aware scheduling accumulates an advantage that grows, because the problem keeps getting harder as hardware gets more heterogeneous and workloads get more varied. The less obvious question is whether this market consolidates around one dominant engine, or fragments across internal forks, managed services, and open-source defaults. RadixArk has some differentiation through Miles, which gives them RL training as a second surface area. But the head-to-head on inference will be intense.
The memory problem goes deeper than weight reads. During inference, you also read the KV cache, the stored attention state for every previous token in the context. The weights are fixed. The KV cache grows linearly with context length and batch size.
For a large transformer model, the KV cache can easily be hundreds of KB per token once you account for all layers, KV heads, head dimensions, and precision. For a model with roughly 100B active parameters and a total KV footprint of around 500 KB per token, the KV cache reaches 100 GB at about 200,000 tokens, matching the weight footprint. Below that, you're mostly paying for the weight read. Above it, KV cache dominates and costs climb linearly. This is a plausible technical reason long-context pricing often steps up past thresholds like 200K tokens, though pricing also reflects product segmentation and capacity planning.
There's a nice observation from Reiner Pope's lecture ↗ about how the memory hierarchy maps onto this. The "drain time" of each tier (capacity divided by bandwidth) sets its natural retention horizon. HBM drains in about 20 milliseconds. DDR drains in seconds. Flash in about a minute. Spinning disk in about an hour. If you're holding KV cache for an idle conversation, it should migrate down this hierarchy rather than hogging HBM.
TensorMesh ↗ is building around this. Their open-source project LMCache ↗ stores KV caches across GPU, CPU RAM, NVMe, and S3, so reusable KV state gets pulled back in rather than recomputed from scratch. My guess is that this is more likely to be absorbed into serving engines than remain a large standalone platform, but in the meantime LMCache is doing work the engines don't yet seem to want to own directly.
Once you push memory and scheduling hard enough at the chip and server level, the constraint moves outward to the package, the board, and eventually the rack.
CoWoS (TSMC's advanced packaging technology, the thing that physically connects GPU dies to HBM stacks on an interposer) has been a genuine supply bottleneck. And the jump from 72-GPU systems to 500+ GPU scale-up domains is not just a networking problem. It runs into connector density, cable bend radius, power delivery, liquid cooling, and HBM attach yield. The problems are mechanical engineering and materials science, which creates real distance from what the chip companies are good at.
The parallel worth thinking about is ASML, which makes the lithography machines every advanced fab depends on. They enable all chip designers without competing with any of them. A company that cracks a packaging or interconnect bottleneck could be in an ASML-like market position, though likely less monopolistic: the value here may be split across TSMC, substrate vendors, optical interconnect suppliers, and ODMs rather than concentrating in a single system-of-systems the way EUV lithography has.
The market is becoming a stack of memory problems. Chips try to keep weights closer to compute. Serving engines try to batch and schedule around bandwidth. KV cache systems try to move attention state through a hierarchy instead of leaving it stranded in HBM. Packaging and interconnect try to make the rack behave more like one machine.
LMCache ↗, the open-source project TensorMesh is built on, is already being pulled into the major serving engines and runtimes even as the company tries to sell it standalone. That's probably the template. The useful primitive gets integrated, and the standalone wrapper has to find something else to be.
And then there's Etched ↗, which is making a different kind of bet entirely, wagering that the transformer architecture is settled enough to burn into fixed-function silicon. If it's right, every transistor does exactly the right thing. If it's wrong, there's much less room to maneuver than there is in a general-purpose accelerator. And now that NVIDIA's Rubin platform includes its own SRAM-based inference accelerator via the Groq deal, Etched isn't just competing against legacy GPUs anymore.
The major counterweight is algorithmic progress: hardware changes slowly, while serving techniques and model architectures can move quickly. Speculative decoding, KV-cache compression, sparsity, distillation, and native low-bit models all reduce the amount of memory movement required per useful token, and if those techniques scale cleanly, they weaken the case for some exotic hardware approaches. But they do not remove the underlying constraint. They change its shape. Lower precision reduces bytes per parameter, but creates demand for native low-bit arithmetic. Speculative decoding amortizes weight reads, but depends on draft-model acceptance rates and leaves KV-cache pressure intact. Sparsity reduces active parameters, but increases routing and interconnect complexity. The target is moving, which is why durable hardware companies need architectures that remain useful as the bottleneck shifts.
I believe durable companies will be built at each layer. What's less clear is whether those companies remain independent. The most useful primitives in AI infrastructure tend to get absorbed quickly by serving engines, clouds, labs, or NVIDIA itself. The hard part isn't proving that the bottleneck matters, but owning a control point that can't be internalized somewhere else in the stack.