The inference unbundling: why prefill and decode are splitting the GPU

Inference is not one workload. It is two — prefill and decode — and they want different silicon. The market is splitting accordingly. Nvidia acquired Groq in a deal reportedly valued at $20 billion. Cerebras partnered with AWS Trainium. Both pair a compute engine with a memory engine. The single-accelerator era is ending. Disaggregated inference, as Cerebras frames it in "The GPU is Being Split in Half," is beginning. (For a broader look at where these components fit, see our earlier piece on the AI inference stack.)

__wf_reserved_inherit

What replaces the single chip is a routed inference stack: each request flows through a FLOPs-rich accelerator for prefill, hands the key-value (KV) cache to a memory-bandwidth-rich accelerator for decode, and streams the output back. We have been tracking this architectural shift across our AI infrastructure portfolio, and I think it is the most consequential change to inference economics since the move to batched serving.

__wf_reserved_inherit

The two phases

Large language model (LLM) generation is autoregressive: every output token depends on every token that came before it. That dependency creates two computational regimes.

__wf_reserved_inherit

Prefill processes the input prompt. Every prompt token is known up front, so the model computes their key-value vectors in parallel — one large matrix multiply that builds the KV cache. The phase is compute-bound, constrained by floating-point operations per second (FLOPs), with arithmetic intensity on the order of 100–400 FLOPs per byte read.

Decode generates output tokens one at a time. Each step streams the full model weights and the entire KV cache from memory to multiply against a single activation vector, then appends one new K/V pair to the cache. The phase is memory-bandwidth-bound, with arithmetic intensity of roughly 1–2 FLOPs per byte.

The same model on the same hardware presents two entirely different workloads.

The metrics that decide deployment

Two latencies decide whether an inference deployment is usable.

Time to first token (TTFT) is set by prefill. Interactive UX breaks above roughly 3 seconds.
Time per output token (TPOT) is set by decode. Interactive targets sit at 100–300 ms per token, at minimum matching reading speed.

__wf_reserved_inherit

API pricing reflects the asymmetry. Output tokens cost meaningfully more than input tokens at every major provider — GPT-4o lists $2.50 per million input tokens against $10.00 per million output tokens, a 4x premium on the memory-bound phase. Decode is where the margin lives.

Why one chip can't do both well

Modern graphics processing units (GPUs) are designed around high bandwidth memory (HBM)-fed tensor cores. Prefill uses both. Decode uses neither well — the tensor cores idle while HBM saturates streaming weights for a single output token at a time. Plot it on a roofline and the two phases sit on opposite sides of the knee. During decode on an H100 SXM, the GPU sustains fewer than 50 tera floating-point operations per second (TFLOPS) out of a rated 990 TFLOPS peak — less than 5% tensor core utilization — because the memory bus, not the compute array, gates every token.

__wf_reserved_inherit

Inference engines have spent the past two years working around this. vLLM's (an open-source LLM serving engine) chunked-prefill scheduler interleaves prefill chunks with decode steps so compute and memory bandwidth saturate simultaneously on the same GPU. TNG Technology Consulting reports the technique increased total token throughput by approximately 50% in their production vLLM deployment serving Llama-3.1-8B (source). The lesson is structural: even on one chip, the operator gets paid to treat prefill and decode as different jobs.

The silicon-level unbundling is the same trick at scale.

Nvidia + Groq: FLOPs buys bandwidth

Nvidia owns prefill. Hopper and Blackwell are HBM-rich and FLOPs-rich, and they run the dominant share of frontier-model context processing. They do not own decode economics. HBM is supply-constrained — three vendors (SK Hynix, Samsung, and Micron) — and decode burns bandwidth that could otherwise serve training. (We mapped this dynamic in detail in The Memory Triopoly.)

The relevant Nvidia HBM specs:

__wf_reserved_inherit

Each generation buys more bandwidth, but the underlying architecture is still HBM-fed tensor cores. Decode still loads the full weight matrix from external memory for every token.

Diagram comparing NVIDIA Rubin GPU and Groq 3 LPU, highlighting Rubin's high FLOPS and large HBM memory versus Groq 3 LPU's high-bandwidth on-chip SRAM for low-latency inference.

Groq is the architectural opposite. No HBM. The Language Processing Unit (LPU) runs entirely from on-die static random-access memory (SRAM), which is faster per byte than HBM by at least one order of magnitude and is not bottlenecked by the HBM3e supply chain. The chip does not win on raw FLOPs. It wins on bandwidth-per-dollar, where bandwidth is the constraint — i.e., decode.

Diagram showing NVIDIA Dynamo coordinating heterogeneous inference across two racks. Large context is processed on VR NVL72 GPUs for prefill and KV cache creation. During decode, GPUs run attention over the KV cache while LPX runs FFN/MoE execution, with interim decode activations exchanged between GPU and LPX in a repeated loop to generate tokens.

Acquiring Groq lets Nvidia route prefill to Blackwell and decode to LPUs over high-bandwidth fabric. The HBM bus stops being the bottleneck on output-token economics. Nvidia's HBM exposure on inference workloads tightens. The strategic prize is freeing HBM supply for training, where it has no substitute.

Cerebras + AWS: SRAM meets Trainium

Cerebras is the wafer-scale extreme. The Wafer Scale Engine 3 (WSE-3) is a single 46,225 mm² die — roughly 57x the silicon area of an H100 — with 44 GB of on-chip SRAM. At reduced precision, a 70B-class model and its KV cache fit on-die. Decode runs at SRAM speed. Cerebras has built its commercial pitch around sustained tokens-per-second on memory-bound inference, and the architecture maps directly onto the decode workload.

What Cerebras lacks is fleet-scale prefill capacity. CS-3 systems are scarce, capital-intensive per unit, and optimized for memory-bound work. Long-context prefill — increasingly the dominant workload as agentic and reasoning models drive context windows toward 1M tokens — needs cheap, abundant FLOPs.

AWS Trainium fills the gap. Trainium2 is a high-FLOPs, HBM-equipped accelerator built for compute-bound work. The integrated Cerebras–AWS stack runs prefill on Trainium fleets, hands KV state to Cerebras CS-3 systems, and serves decode at wafer-scale bandwidth. AWS monetizes Trainium silicon on a workload where Nvidia GPUs already dominate. Cerebras gets the prefill volume the WSE was never designed to deliver.

The Cerebras–AWS pairing is a mirror-image strategy to Nvidia/Groq, built on the same underlying thesis.

The economics

Three forces make the unbundling structural rather than a clever optimization.

Workload mix is shifting toward decode. Reasoning and agentic models output 10–50x more tokens per request than chat-style models did 18 months ago — a single chain-of-thought trace from OpenAI's o1 can generate over 10,000 output tokens where a GPT-4 chat reply rarely exceeded 500. A growing share of frontier-model inference revenue comes from decode rather than prefill.
HBM is the binding constraint on decode capacity. HBM3e production runs near full utilization, with 2026 supply already pre-allocated — a dynamic we traced in our analysis of the structural bottlenecks in the AI data center supply chain. Any decode workload that migrates to SRAM-based silicon frees HBM for training and prefill, where the alternatives are worse.
Decode carries the margin. At a 4x price premium over input tokens, decode is the gross-margin lane. Operators serving decode on lower-cost silicon capture the spread directly. A disaggregated stack lowers cost per output token versus a single-architecture deployment — the exact figure depends on workload mix, but the sign is unambiguous.

What this changes

The HBM oligopoly's pricing power weakens at the margin. Decode migration off HBM-dependent silicon slows HBM3e demand growth and reshapes HBM4 timing assumptions.
Inference orchestration becomes a control point. The router that decides which phase runs on which silicon — and tunes batch size, chunk size, and KV-cache transfer policy — is the new margin layer. The technical primitives already exist inside vLLM (chunked prefill, prefix caching, KV-cache offloading). The step from "interleave on one GPU" to "route across heterogeneous accelerators" is small. Nvidia (Dynamo, post-Groq), AWS (Bedrock + Trainium + Cerebras), and independent inference platforms (Together, Fireworks, Baseten) will fight for it.
Capital intensity per token declines. A unified GPU stack overpays for FLOPs on decode and overpays for HBM on prefill. The disaggregated stack pays for each only where it is consumed.
New silicon entrants get a clearer wedge. The single-architecture incumbency advantage erodes when the workload itself is two distinct workloads. Memory-bound silicon (Groq, Cerebras, d-Matrix, Tenstorrent) gets a defensible decode lane. Compute-bound silicon keeps prefill but loses its decode rent.

Bottom line

Prefill and decode operate in different physical regimes with different cost structures and different margin profiles. Prefill needs raw FLOPs and saturates tensor cores. Decode needs raw bandwidth and saturates memory buses. No single chip is optimized for both, and the pricing asymmetry — output tokens at 4x the cost of input tokens — means the economic incentive to specialize is large and growing.

Inside a single GPU, scheduler-level disaggregation through chunked prefill already pays, as TNG's 50% throughput gain demonstrates. Across silicon, Nvidia/Groq and Cerebras/AWS are executing the same structural trade: pair a compute engine with a memory engine, route the workload, capture the spread. Both partnerships point to the same conclusion.

The single-accelerator era is over. The routing layer — the software that decides which phase runs on which silicon and manages KV cache transfers between them — is where the margin moves next. Founders building in this space should pay close attention.

Read Full Article

Chris Zeoli

Author

No items found.

The Inference Unbundling: Why Prefill and Decode Are Splitting the GPU

April 27, 2026

The inference unbundling: why prefill and decode are splitting the GPU

The two phases

The metrics that decide deployment

Why one chip can't do both well

Nvidia + Groq: FLOPs buys bandwidth

Cerebras + AWS: SRAM meets Trainium

The economics

What this changes

Bottom line