1. Overview
Wafer-Scale Integration (WSI) is not an original concept of Cerebras. In 1980, Gene Amdahl, the father of the IBM mainframe, founded Trilogy Systems, attempting to manufacture an entire wafer as a single processor. Trilogy raised $230 million from entities including IBM and Sperry Rand — the largest startup financing in Silicon Valley history at the time — but during prototype testing, the entire wafer short-circuited upon power-up and burned to a dim red glow, metal wiring layers delaminated, and the thermal solution failed completely. Combined with a devastating fab flood and the sudden death of the company president, along with Amdahl himself being seriously injured in a car accident, Trilogy ended in total failure five years after its founding. In the same period, Texas Instruments, ITT, and the U.S. National Security Agency (NSA) all attempted the WSI route, but the shared conclusion was: manufacturing a commercial wafer-scale chip would require 99.99% fabrication yield — something considered impossible to achieve for at least 100 years at the time.
Cerebras Systems was founded in 2015 by the core SeaMicro team (Andrew Feldman, Gary Lauterbach, Michael James, Sean Lie, Jean-Philippe Fricker). SeaMicro had been known in 2007 for its high-density, low-power micro-server designs and was acquired by AMD in 2012 for $334 million. This team had a deep understanding of how to solve system-level bottlenecks with unconventional architectures — a DNA that carried directly into Cerebras’s technical approach to WSI.
Cerebras provided engineering solutions in the following five dimensions that all previous WSI attempts had failed to deliver [6][8]: defect tolerance and yield control, wafer-scale cross-die interconnect (reticle stitching), mechanical compensation for thermal expansion coefficients, a vertical power delivery architecture, and high-flow-rate direct liquid cooling. This transformed WSI, which had been suspended in theoretical discussion since the 1980s, into a mass-producible commercial reality for the first time.
As of 2026, Cerebras has introduced three generations of Wafer-Scale Engine (WSE-1/2/3), building a complete product line from the single-chip CS-3 system to 2,048-node clusters. On May 14, 2026, it completed its IPO on Nasdaq (ticker: CBRS), raising $5.55 billion at $185 per share, with a first-day opening price of $350 and a fully diluted valuation of approximately $48.8 billion [15][17], making it the largest U.S. technology IPO since 2019.
2. Product Evolution: From WSE-1 to WSE-3
2.1 Core Parameter Comparison Across Three Generations of Wafer-Scale Engines
Since launching its first WSE in 2019, Cerebras has driven generational technology leaps on roughly a two-year cycle. The process node evolved from TSMC 16nm to 5nm, transistor count grew from 1.2 trillion to 4 trillion, and core compute power surged from 47 PFLOPS to 125 PFLOPS. Below is a comprehensive physical parameter comparison of the three WSE generations against the NVIDIA H100:
| Specification | WSE-1 (2019) | WSE-2 (2021) | WSE-3 (2024) | NVIDIA H100 (Reference) |
|---|---|---|---|---|
| Process Node | TSMC 16nm | TSMC 7nm | TSMC 5nm | TSMC 4N |
| Wafer/Die Area | 46,225 mm² | 46,225 mm² | 46,225 mm² | 814 mm² |
| Transistor Count | 1.2 Trillion | 2.6 Trillion | 4.0 Trillion | 80 Billion |
| AI-Optimized Cores | 400,000 | 850,000 | 900,000 | 16,896 (CUDA cores) |
| On-Chip Memory (SRAM) | 18 GB | 40 GB | 44 GB | 0.05 GB (L2 Cache) |
| On-Chip Memory Bandwidth | 9 PB/s | 20 PB/s | 21 PB/s | ~0.003 PB/s (HBM3) |
| On-Chip Interconnect Bandwidth | 100 Pb/s | 220 Pb/s | 214 Pb/s | 0.0576 Pb/s (NVLink) |
| FP16 Peak Compute | 47 PFLOPS | 75 PFLOPS | 125 PFLOPS | ~2 PFLOPS |
| System Product | CS-1 | CS-2 | CS-3 | DGX H100 |
2.2 Key Changes in Generational Evolution
WSE-1 (2019): The first-generation commercial wafer-scale chip, integrating 400,000 cores and 1.2 trillion transistors on a 16nm process. 18 GB of on-chip SRAM provided 9 PB/s of bandwidth and 47 PFLOPS of compute. The CS-1 system, a 19-inch rack-mountable device, proved the commercial viability of the wafer-scale approach. Initial customers included life sciences institutions like GlaxoSmithKline and AstraZeneca, as well as U.S. national laboratories.
WSE-2 (2021): The process node jumped to 7nm, transistor count doubled to 2.6 trillion, and core count increased to 850,000. 40 GB SRAM with 20 PB/s bandwidth pushed compute to 75 PFLOPS. The WSE-2 entered the Computer History Museum’s collection, named “The Biggest Chip In the World.” The CS-2 system supported training models exceeding 120 trillion parameters for the first time and underpinned the Andromeda (16 interconnected units, 1 ExaFLOP) and Condor Galaxy series supercomputers.
WSE-3 (2024): 5nm process, 4 trillion transistors, 900,000 cores, 44 GB SRAM (SRAM growth approached saturation — only a 10% increase from WSE-2 to WSE-3, while transistor count grew 54%), 21 PB/s bandwidth, 125 PFLOPS. The CS-3, in a 15U chassis and 23 kW power consumption, doubled performance at the same power envelope as the WSE-2 [2][4]. It was named one of Time magazine’s Best Inventions of 2024.
2.3 CS-3 System Specifications
| Specification Item | Parameter |
|---|---|
| Processor | WSE-3 (5nm, 4 Trillion Transistors, 900,000 Cores) |
| Peak Compute | 125 PFLOPS (FP16) |
| On-Chip Memory | 44 GB SRAM (21 PB/s Bandwidth) |
| External Memory Expansion | MemoryX (1.5 TB ~ 1.2 PB) |
| Cluster Scalability Limit | 2,048 Nodes (256 ExaFLOPs) |
| Cooling | Proprietary Water Cooling (100 L/min, 20 C) |
| Power Consumption | ~23 kW |
| Form Factor | 15U Rack |
| Model Capacity | Up to 24 Trillion Parameters |
2.4 Condor Galaxy Supercomputing Network
Cerebras collaborated with Abu Dhabi’s G42 group to deploy the Condor Galaxy (CG) series supercomputers:
| System | Announcement Date | Peak Compute | Total Wafer Cores | Location |
|---|---|---|---|---|
| CG-1 | July 2023 | 4 ExaFLOPs | 54 Million | USA |
| CG-2 | November 2023 | 4 ExaFLOPs | 54 Million | USA |
| CG-3 | March 2024 (Groundbreaking) | 8 ExaFLOPs | 58 Million | Dallas |
| Full Network Aggregate | - | 16 ExaFLOPs | 166 Million | Cross-Region |
2.5 Scientific Computing Benchmark: Molecular Dynamics Simulation
In collaboration with Sandia National Laboratories, Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory (LANL), and the U.S. National Nuclear Security Administration (NNSA), researchers successfully simulated high-precision interactions between 800,000 atoms on the WSE-2 [3][20]. The simulation computed with a time step of 1 femtosecond (10^-15 seconds), and a single step on the WSE-2 took only microseconds. Its speed significantly surpassed Frontier, then the world’s top supercomputer built on traditional nodes, demonstrating the wafer-scale architecture’s innate hardware suitability for the extreme local real-time feedback demands of simulating strongly coupled physical systems.
3. Breaking the Reticle Limit: Scribe Line Stitching and Wafer-Scale Lithography
3.1 Physical Constraint: The Reticle Limit
The core limitation in semiconductor lithography comes from the Field of View of the optical lens. For current mainstream Deep Ultraviolet (DUV) and Extreme Ultraviolet (EUV) lithography machines, the maximum pattern area printable in a single exposure is limited by the physical size of the reticle (photomask) — typically 26 mm x 33 mm, approximately 858 mm². This means that, regardless of design, the physical area of any single-exposure die cannot exceed this limit. Traditional chip manufacturers use a Step-and-Repeat process to expose the same pattern multiple times across the wafer, subsequently mechanically cutting along scribe lines to divide the wafer into dozens or hundreds of individual chips.
3.2 Reticle Stitching Process Details
TSMC executes a standard step-and-repeat lithography process on a 300 mm wafer, printing a total of 84 identical dies (each approximately 858 mm², arranged in an 8x10.5 grid). Unlike traditional processes, after standard exposure is complete, Cerebras adds extra lithography steps to fabricate miniature metal wires spanning the scribe line regions in the upper metal layers. These wires, less than 1 mm in length and running in the mid-to-upper levels of the on-chip metallization stack, physically connect the on-chip interconnect network (2D Mesh Fabric) of all 84 dies into a single, continuous plane.
This cross-die interconnect system comprises over 1 million wires. The protocol stack layer includes built-in redundancy mechanisms for defective wires (spare wires + automatic rerouting). From the compiler and software perspective, the boundaries of these 84 dies do not exist — the entire wafer presents as a unified, continuous 2D Mesh compute plane.
Technical Cost: Reticle stitching increases the number of photomasks and production steps, making the wafer manufacturing cost higher than that of a standard GPU wafer. However, Cerebras’s argument is that this additional cost is offset by the elimination of off-chip packaging, inter-chip connectivity, and system integration costs enabled by wafer-scale integration.
3.3 Core Micro-Design: The Physical Significance of a 0.05 mm² Core
A single AI-optimized core in the WSE-3 occupies an area of 0.05 mm² — approximately 1/120th the area of a Streaming Multiprocessor (SM, ~6 mm²) in an NVIDIA H100. This extremely small core size has multiple physical implications:
- Minimized Defect Cost: At a given defect density, a single defect renders only 0.05 mm² of silicon area non-functional, instead of 6 mm² — a 120x reduction in the economic cost of defects.
- Fine-Grained Redundancy: Within a fixed silicon area, a far greater number of physical cores than functionally required can be integrated, providing ample redundant spares.
- Short Interconnect Latency: The physical distance between cores is on the order of tens of micrometers, with signal propagation delay being just 1 clock cycle.
Within this tiny 0.05 mm² space, the silicon area allocation is roughly: about 50% for a 48 KB single-cycle SRAM, and the remaining 50% for general-purpose tensor and sparse algebra computation logic composed of approximately 110,000 standard gates. A single core’s peak power consumption at a frequency of 1.1 GHz is only 30 mW.
4. Defect Tolerance and Yield
4.1 The Physical Reality of Defect Density
The typical defect density for TSMC’s 5nm process is approximately 0.001 defects per mm² (data for mature nodes). On the 46,225 mm² WSE-3, this density translates to roughly 46 random physical defects per wafer. In traditional chip manufacturing, any single one of these defects falling within the active area of a chip renders the entire die non-functional — this is the fundamental reason why, for 75 years, chips have been made smaller, not larger.
4.2 Cerebras’s Solution: A Three-Layer Mechanism for 100x Defect Tolerance
Layer 1: Core Miniaturization. A single core area of 0.05 mm² versus the ~6 mm² of an H100 SM creates an asymmetry in defect cost. At the identical defect density of 0.001 defects/mm²: one defect on WSE-3 has a 50% probability of landing inside a core area, with an expected loss of 0.025 mm² of silicon area; one defect on H100 has a 99.8% probability of landing inside an SM area, with an expected loss of approximately 3 mm² of silicon area. Extrapolating from this:
Layer 2: Physical Redundancy. WSE-3 physically integrates 970,000 cores on the wafer but nominally enables only 900,000. The 70,000 extra cores (approximately 7.2% physical redundancy) provide ample spare capacity.
Layer 3: Fail-in-Place Resilient Routing. During the chip’s power-on initialization phase, test logic identifies the locations of all defective cores. The on-chip reconfigurable interconnect network then automatically bypasses the failed cores, remapping neighboring healthy cores to the corresponding positions in the logical grid. This process is completed entirely automatically in hardware and is transparent to the software layer.
The net effect of this three-layer mechanism: The effective active silicon area ratio of the WSE-3 reaches approximately 93% (900,000 / 970,000), achieving a usable yield at commercial scale comparable to diced chip processes. Cerebras’s core insight is that solving the yield problem does not depend on reducing defects, but on making the economic cost of each defect approach zero.
5. Micro-Core Architecture and On-Chip Dataflow Network
5.1 Internal Structure of a Compute Core
Each WSE core internally contains:
- 48 KB single-cycle SRAM, using an 8-Bank split architecture (6 KB per Bank, 32-bit width), supporting simultaneous conflict-free access of 2x 64-bit reads + 1x 64-bit write per clock cycle.
- 256 Bytes of software-managed cache, specifically for storing high-frequency changing data structures like accumulators.
- Compute logic composed of 110,000 standard gates, supporting tensor multiply-accumulate and sparse matrix operations.
- Native sparse triggering in the instruction set: upon detecting an input weight is zero, it automatically skips the multiply-accumulate operation, yielding a several-fold effective speedup when processing highly sparse large language models.
5.2 On-Chip Interconnect Network Architecture
The WSE builds a high-speed interconnect network based on a 2D Mesh topology. Each core integrates a 5-port structural router (East, West, South, North, Local), supporting bidirectional 32-bit single-cycle data transfer. Each physical transmission packet consists of 16-bit compute data + 16-bit index data, perfectly matching the coordinate addressing requirements of sparse matrix computation.
Network communication is divided at the hardware level into 24 independently configurable static routing colors (Colors). Each color has hardware-isolated dedicated buffer queues, sharing the physical bus via Time-Multiplexing for non-blocking transmission. The on-chip Fabric natively supports hardware-level single-cycle Broadcast and Multicast. Since physical wires between cores are only tens of micrometers, cross-core signal latency is just 1 clock cycle (~0.9 ns at 1.1 GHz).
Fundamental Architectural Difference from GPU: The WSE uses a Dataflow Architecture — computation is driven by the arrival of data. 32-bit wavelet messages travel through the 2D grid; the wavelet’s 5-bit color tag determines the routing path and trigger task. When a wavelet arrives at a specific color channel, the bound task is launched for execution. If the weight is zero, no wavelet is emitted, achieving unstructured sparsity acceleration. In contrast, GPUs use a control flow architecture (SIMT/Warp) — execution is driven by the program counter; all 32 threads in the same warp execute the same instruction and cannot skip zero-value computations.
5.3 On-Chip Memory Hierarchy
| Level | Medium | Capacity | Aggregate Bandwidth | Latency | Physical Location |
|---|---|---|---|---|---|
| L0 - Register | Core-Private | 256 B | 5.3 TB/s (Single Core Peak) | 1 cycle | Inside Core |
| L1 - SRAM | Core-Private | 48 KB x 900,000 = 44 GB | 21 PB/s (Full Chip Aggregate) | 1 cycle | Inside Core |
| L2 - MemoryX | DRAM + Flash | 1.5 TB ~ 1.2 PB | Proprietary Protocol | High | External Cabinet |
| L3 - SwarmX | Switch Network | Cluster-Level | Broadcast/Reduce Hardware Acceleration | Topology-Dependent | Cluster Interconnect |
The 21 PB/s aggregate bandwidth of the 44 GB on-chip SRAM in the WSE-3 fundamentally changes the compute economics. Compared to the 3 TB/s of H100’s HBM3, the difference is a factor of 7,000. More critically, SRAM bandwidth scales linearly with capacity (each bank can be read in parallel by adjacent compute units), whereas HBM bandwidth is limited by the number of physical channels — this is an architectural difference, not one that can be bridged by process advancements alone.
6. Power Delivery, Thermal Management, and Thermal Expansion Mechanical Compensation Engineering
6.1 Power Delivery Challenge: Voltage Consistency at 23 kW
The WSE-3 has a full-load rated power consumption of 23 kW, with an operating voltage of approximately 0.8-0.9V (sub-volt level), requiring a continuous current injection of roughly 28,750 to 30,000 Amps. In a traditional horizontal two-dimensional power delivery architecture, power is routed from the chip edge through lateral bus bars on the PCB. Due to the physical impedance of metal wiring, a current of 30 kA level crossing a 215 mm wafer would generate a catastrophic IR Drop — theoretically, the voltage drop from edge to center could reach 9.6V, while the chip’s operating voltage is only 1V, making it impossible to power cores in the central region.
Solution: 3D Vertical Power Delivery. Cerebras placed a custom multi-layer high-density power distribution PCB directly behind the wafer, embedding over 300 high-frequency step-down Voltage Regulator Modules (VRMs). Current is projected perpendicularly to the wafer surface directly onto micro-electrical contacts on the back of each core, over a physical distance of only a few millimeters. Each of the 84 die areas has its voltage regulated independently, completely eliminating lateral IR Drop. The entire power delivery network is encapsulated within a four-layer physical sandwich called the Engine Block: Cold Plate, Wafer, Custom Compliant Connector, Power PCB.
6.2 Coefficient of Thermal Expansion Mismatch and Mechanical Compensation
The core packaging challenge faced by the system originates from the Coefficient of Thermal Expansion (CTE) mismatch of heterogeneous materials:
| Material | CTE (ppm/C) | Corner Displacement under 65 C Rise (215 mm x 215 mm) |
|---|---|---|
| Silicon | 2.6 | ~36 µm |
| FR-4 PCB | 15 (Lateral) | ~210 µm |
| Copper (Cold Plate) | 17 | ~238 µm |
The expansion of the PCB under a 65 C temperature rise is roughly 5.8 times that of silicon. For traditional packaging (BGA, flip-chip, wire bonding), a 122 µm relative corner displacement (PCB vs. Si) already exceeds their failure threshold by a factor of 5-7.
Solution: Co-founder Jean-Philippe Fricker led the design of a custom Compliant Elastomeric Connector. This connector, sandwiched between the wafer and the PCB, maintains good electrical conductivity in the vertical direction while possessing high physical compliance and deformation resilience in the horizontal shear direction. When temperature differences cause the PCB to expand more than the silicon, the elastic connector layer absorbs all shear stress through microscopic physical shear deformation, ensuring contact reliability for hundreds of thousands of power and signal pins.
Additionally, a dynamic Ambulating Thermal Interface (ATI) material is embedded between the cold plate and the back of the wafer. Composed of a high thermal conductivity material and a physical friction-reducing material laminated together, it allows the water-cooled copper plate to undergo micron-level lossless horizontal sliding against the silicon surface during thermal deformation, preventing stress transfer that could physically fracture the silicon wafer.
No off-the-shelf automated equipment could precisely handle such a large-area, fragile, heterogeneous 3D layered assembly. Cerebras designed and built dedicated high-precision alignment and pressure assembly machines from scratch to close the mechanical structure loop for wafer-scale components.
6.3 Liquid Cooling System
The system employs Direct-to-Chip Liquid Cooling. Dual-redundant industrial high-pressure pumps inject cooling water at 20 +/- 2 C, at a flow rate of 100 +/- 10 L/min, into a brass manifold cold plate conforming to the wafer surface. The interior of the cold plate is machined with micro-fin channels to maximize heat exchange surface area.
At the data center level, row-based and in-rack high-precision fluid manifold controls are deployed, with digital monitoring of flow and pressure to eliminate stagnant zones. The CSoft software layer runs dynamic duty-cycle dummy operations (Power Ramp Smoothing via Dummy Operations) when no computational workload is active, smoothing power transients to stay within electrical safety boundaries and preventing potential physical damage to the wafer-scale system from instantaneous, drastic voltage fluctuations.
7. CSoft Compilation Pipeline and Execution Modes
7.1 Compiler Hierarchy
The core of the Cerebras CSoft software platform is the Cerebras Graph Compiler (CGC), responsible for losslessly mapping a PyTorch/TensorFlow computational graph onto the physical grid of 900,000 cores. The compilation pipeline follows a stepwise Lowering logic:
PyTorch Model – Lazy Tensor Backend (ATen Operator Graph Capture) – XLA HLO (High-Level Optimization Custom Calls) – CIRH (Cerebras IR High, an MLIR dialect, full-graph level rewrite passes) – Operator Deep Fusion, Constant Folding, Common Subexpression Elimination, Dead Code Pruning – Pattern Matching against a Pre-built Library of Hand-Tuned High-Performance Kernels –【Match Success】Directly Generate Optimized Instructions –【Match Failure】CLAIR/LAIR (Low-Level Linear Algebra IR) – AutoGen Automated Kernel Compiler – Polyhedral Space Transformation Mathematical Optimization – 2D Core Topology-Aware Placement – 8-Bank SRAM Allocation – Final Physical Instruction Machine Code.
The AutoGen kernel compiler supports four strategies: default, disabled, medium, and aggressive, providing adaptive kernel generality for customized development.
Key Constraint: The computational graph must be a Static Graph — dynamic shapes or data-dependent branching are not supported. The routing table is fixed once at compile time for all 900,000 cores.
7.2 Two Execution Modes
Layer-Pipelined Mode - Model Fully Resident On-Chip:
flowchart LR
Input[Input Data Stream] --> L1[WSE Partition 1
Layer 1]
L1 --> L2[WSE Partition 2
Layer 2]
L2 --> Ldots[...]
Ldots --> LN[WSE Partition N
Layer N]
LN --> Output[Output]
Characteristics: Model parameters are fully resident on-chip. Multiple micro-batches run concurrently, spatially interleaved across the wafer. The compiler must solve a VLSI floorplanning problem — one Cerebras proved to be NP-hard at ISPD 2020. Applicable for models whose parameters fit within the 44 GB SRAM.
Weight Streaming Mode - Current Default:
flowchart LR
subgraph MemoryX[External MemoryX Storage]
W1[Layer 1 Weights]
W2[Layer 2 Weights]
WN[Layer N Weights]
end
W1 --> WS[WSE Full Chip
900,000 Cores
Single Layer Computation]
W2 --> WS
WN --> WS
WS --> Grad[Gradient Return]
Grad --> MemoryX
Characteristics: The entire wafer processes a single layer at a time. Weights reside in external MemoryX (DRAM + Flash, up to 1.2 PB) and are streamed to the wafer layer-by-layer. All 900,000 cores process the same layer; activations remain on-chip, and weights are discarded after computation. Scaling to 2,048 systems requires only changing a single flag.
7.3 The Order-of-Magnitude Difference in Software Complexity
Training a 175 billion parameter large model on a GPU cluster typically requires ~20,000 lines of distributed training code (a combination of Tensor Parallelism + Pipeline Parallelism + FSDP + DeepSpeed + Megatron-LM). Cerebras claims an equivalent scale requires just 565 lines of PyTorch code, and that the software complexity of training a 1 trillion parameter model is comparable to training a 1 billion parameter model on GPUs — a frequently underestimated competitive advantage.
7.4 Developer Interfaces
- AI Inference Users: OpenAI-compatible API, zero learning curve.
- Model Training Users: Standard PyTorch / TensorFlow frameworks; CSoft handles the low-level details.
- HPC Developers: CSL SDK — a Zig-based DSL allowing programming of individual cores, manual configuration of routing tables, and adaptation of code and data within the 48 KB memory. There are no thread concepts, no shared memory, no kernel launches, but also no need to handle synchronization or race conditions.
8. Wafer-Scale Supercomputer Cluster Architecture
8.1 Cluster Components
| Component | Function | Specification |
|---|---|---|
| CS-3 | Single System Compute Unit | 15U, 23 kW, 125 PFLOPS |
| MemoryX | External Weight Storage Node | 1.5 TB ~ 1.2 PB |
| SwarmX | Cluster Switching Network | Hardware-Level Broadcast + Gradient Reduce/Sum |
| CSL | Cluster Interconnect Topology | Up to 2,048 Nodes, 256 ExaFLOPs |
| AI400X2 | Parallel File Storage | 90+ GB/s Sustained Bandwidth, 3M+ IOPS |
8.2 DARPA Co-Packaged Optics Project
Under DARPA funding, Cerebras is collaborating with Ranovus to develop wafer-scale Co-Packaged Optics (CPO). The goal is to directly mount Ranovus’s optical fiber transceivers onto the edge of the wafer, replacing traditional electrical off-chip interconnects with a multi-wavelength, multi-mode fiber optic network. This solution can provide over 100x the data throughput capacity of conventional CPO solutions while drastically reducing the power consumption of cluster-level parameter transfer.
9. Wafer-Scale Architecture vs. NVIDIA GPU Clusters
9.1 System-Level Physical Specification Comparison
| Evaluation Dimension | Cerebras CS-3 | NVIDIA B200 (Single Card) | NVIDIA DGX B200 (8-Card Node) | NVIDIA GB200 NVL72 (Full Rack) |
|---|---|---|---|---|
| Core Physical Form | Single Wafer Integrated (15U Chassis) | Dual-Die Bridged (SXM Module) | 8-Card Parallel (10U Chassis) | 72-Card High-Density Parallel (Full Rack) |
| FP16 Peak Compute | 125 PFLOPS | 4.4 PFLOPS | 36 PFLOPS | 360 PFLOPS |
| On-Board Memory Capacity | 44 GB SRAM (On-Chip) | 192 GB HBM3e | 1.5 TB HBM3e | 13.5 TB HBM3e |
| Memory Access Bandwidth | 21,000 TB/s (21 PB/s) | 8.0 TB/s | 64 TB/s | 576 TB/s |
| Inter-Chip Communication Performance | On-Chip Metal Traces, Zero External Loss | SXM Physical Socket | NVSwitch On-Board Routing | 9 Sets High-Speed Copper + Optical Switch |
| Max Rated Power Consumption | ~23 kW | ~1,200 W | ~14.3 kW | ~120 kW |
| Rack Space | 15U | Single Slot | 10U | 42U |
| LLM Training Programming Complexity | Pure Data Parallelism, ~565 Lines of Code | Not Applicable | Tensor/Pipeline/FSDP Combination | Extremely Complex Network Topology Configuration |
9.2 Training Deployment: Parameter Scale vs. Physical Requirements
| Model Parameter Scale | GPU Cluster Requirement (B200) | Cerebras CS-3 Requirement |
|---|---|---|
| 100 Billion (100B) | >=12 B200 GPUs + NVSwitch + InfiniBand | 1x CS-3 + 2.4 TB MemoryX |
| 1 Trillion (1T) | Hundreds of B200 GPUs + Fiber Interconnect Cluster | 1x CS-3 + 1.2 PB MemoryX |
| 10 Trillion (10T) | 1,000+ B200 Servers | 1x CS-3 + 1.2 PB MemoryX |
9.3 Inference: Roofline Model Analysis
The arithmetic intensity formula for the LLM token generation (decode) phase is: Arithmetic Intensity = FLOPs / Bytes, approximately 1 FLOP/byte. The Ridge Point = Peak FLOPS / Peak Memory Bandwidth.
| Chip | FP16 Peak | Memory Bandwidth | Ridge Point | Decode Arithmetic Intensity | State |
|---|---|---|---|---|---|
| H100 | 989 TFLOPS | 3.35 TB/s | 295 FLOP/byte | 1 FLOP/byte | 99.7% Compute Units Idle |
| B200 | 4.4 PFLOPS | 8 TB/s | 550 FLOP/byte | 1 FLOP/byte | >99.8% Idle |
| WSE-3 | 12.5 PFLOPS | 21 PB/s | 0.6 FLOP/byte | 1 FLOP/byte | Compute Bound |
The WSE-3 is the only chip that is compute-bound during the decode phase at a batch size of 1 (Batch-1). This means that a single user request can achieve full hardware utilization, eliminating the need for GPUs to rely on large batch sizes to amortize the overhead of reading weights from HBM.
Measured Inference Performance Comparison (Independent Benchmarks):
| Model | Cerebras CS-3 | NVIDIA DGX B200 | Advantage Factor |
|---|---|---|---|
| Llama 4 Maverick (400B) | 2,500+ tok/s/user[7][12] | ~1,000 tok/s/user | 2.5x |
| gpt-oss-120B (10 Concurrency) | 2,700+ tok/s | 580 tok/s | 4.7x |
| DeepSeek R1 70B | 1,600 tok/s | - | - |
| Perplexity Sonar | 1,200 tok/s | - | - |
9.4 Inference TCO Comparison
Cerebras CS-3 pricing is approximately $2-3 million per node. DGX B200 (8x B200) is around $300,000. However, Cerebras claims that under an equivalent online token generation load, the CS-3’s comprehensive TCO (hardware CapEx + power OpEx) is approximately 32% lower than DGX B200, while delivering 21x faster single-token interaction speed. The core of this pricing logic is that a single CS-3 is equivalent in inference throughput to multiple DGX B200s and eliminates the significant hidden costs of InfiniBand networks, multi-rack space, and cluster management software.
10. I/O Bottleneck
10.1 I/O Bottleneck: A 133,000x Gap
On-Chip SRAM Bandwidth: 21 PB/s. Off-Chip MemoryX/Network I/O: approximately 150-200 GB/s. Gap: approximately 133,000x.
For models that fit within the 44 GB SRAM, this is not an issue — all data circulates on-chip. However, mainstream large language models are rapidly surpassing this capacity (Llama 4 Maverick 400B requires 800 GB of weights in FP16). In Weight Streaming mode, the speed at which weights for each layer are streamed from MemoryX is bottlenecked by the off-chip bandwidth, making I/O the bottleneck.
The collaboration model with AWS — where Trainium handles Prefill and Cerebras handles Decode — is essentially an implicit acknowledgment of the chip’s insufficient efficiency during the compute-bound prefill stage.
10.2 Stagnation in SRAM Density Growth
The area of a high-density SRAM bitcell has essentially stalled at approximately 0.021 µm² from the 5nm node down to 3nm and even 2nm nodes. SRAM scaling faces physical limits (leakage and stability constraints of the 6T cell). Concurrently, HBM is progressing from HBM3 (~5 Gb/s/pin) towards HBM4 (8+ Gb/s/pin), with stack layers increasing from 12 to 16, and capacity heading towards 1+ TB by 2028. Cerebras’s SRAM capacity advantage will face a structural challenge within 2-3 generations.
10.3 The Undisclosed Throughput Crossover Point
Cerebras’s headline-grabbing tok/s figures are single-user speeds. GPUs boost aggregate throughput by batch processing users — serving multiple concurrent requests from the same weight read: at a batch size of 10-20, a single H100’s aggregate throughput may already match a single CS-3; at a batch size of 128+, a DGX H100 system can generate thousands of aggregate tok/s at significantly lower hardware cost. Cerebras has never published its aggregate throughput figures under high-concurrency scenarios. This is the most significant missing data point in publicly available materials.
11. Business Closure and Capital Landscape
11.1 Revenue Growth Trajectory
| Year | Revenue | YoY Growth | Key Drivers |
|---|---|---|---|
| 2022 | $24.6 Million | - | Early deployments for life sciences customers |
| 2023 | $78.7 Million | 220% | CS-2 shipments began |
| 2024 | $290.3 Million | 269% | CG-1/2 delivery + G42 prepayment momentum |
| 2025 | $510 Million | 76% | Large MBZUAI order + Inference cloud launch |
Quarterly trend acceleration is notable: Q1 2025 was $99.5 Million, and Q4 2025 reached $171.4 Million (an annualized run rate of ~$686 Million).
11.2 Deconstructing GAAP Profit: Paper Gains vs. Substantive Losses
| Metric | 2024 | 2025 | Change |
|---|---|---|---|
| GAAP Net Income (Loss) | ($481.6 Million) | $237.8 Million | Turned Profitable |
| Includes: One-Time Non-Cash Gain from Extinguishment of Forward Contract Liabilities | - | $363.3 Million | Paper Adjustment |
| Stock-Based Compensation (SBC) | - | $49.8 Million | Actual Cash Expense |
| Non-GAAP Operating Net Loss | ($21.8 Million) | ($75.7 Million) | Loss Widened by 247% |
| Operating Cash Flow | $452 Million | ($10 Million) | Turned from Positive to Negative |
Key Insight: The $237.8 million GAAP net profit in 2025 is a paper profit, driven by a one-time non-cash gain resulting from the recapitalization of forward purchase contract liabilities with G42. Excluding this, the Non-GAAP operating loss widened from $21.8 million to $75.7 million — a 247% increase in losses. The reason operating cash flow turned from positive to negative is that 2024 included $640.3 million in customer prepayments from G42 (recorded positively in operating cash flow), whereas in 2025, positive prepayments decreased and capacity previously sold was being delivered against.
11.3 Gross Margin Structure
| Business Line | 2024 | 2025 | Q1 2025 | Q2 2025 | Q3 2025 | Q4 2025 |
|---|---|---|---|---|---|---|
| Consolidated Gross Margin | 42% | 39% | - | - | - | - |
| Hardware Gross Margin | - | 43% | - | - | - | - |
| Cloud Services Gross Margin | - | 30% | 68% | 26% | 16% | 21% |
Cloud gross margin plummeted from 68% in Q1 to 16% in Q3, reflecting severe underutilization of capacity in newly built data centers. Cerebras is shifting from a high-margin hardware sales model to a lower-margin cloud services model (the business model of the OpenAI contract). The degree of gross margin convergence in this structural shift will determine the company’s long-term earnings power.
11.4 IPO Pricing History
| Date | Event | Price Range / Pricing | Notes |
|---|---|---|---|
| 2025.10 | Initial Confidential S-1 Filing | Terminated | Stalled due to CFIUS review |
| 2026.05.04 | Amended S-1 Public Filing | $115-$125 | 28 Million Shares |
| 2026.05.10 | First Upward Revision | $125-$135 | Oversubscription exceeded expectations |
| 2026.05.11 | Second Upward Revision | $150-$160 | 30 Million Shares |
| 2026.05.13 | Final Pricing | $185 | 20x Oversubscribed |
| 2026.05.14 | First Day Open | $350 | +89% vs. Pricing |
| 2026.05.14 | First Day Close | $311 | Fully Diluted Valuation ~$48.8 Billion |
The consistency from the pre-IPO secondary market (Hiive) average trading price of $187.53 to the IPO pricing of $185 indirectly confirms the market’s high expectations for Cerebras’s concept of a differentiated AI processor.
11.5 Funding History
| Date | Round | Amount | Per Share | Implied Valuation |
|---|---|---|---|---|
| 2016.5 | Series A | $27 Million | - | - |
| 2018.11 | Series D | $88 Million | - | Unicorn Status |
| 2019.11 | Series E | $270 Million | - | $2.4 Billion |
| 2021.11 | Series F | $250 Million | - | Over $4.0 Billion |
| 2024.7-9 | Series F-1 | $85 Million | $14.66 | - |
| 2025.9 | Series G | $1.1 Billion | $36.23 | $8.1 Billion |
| 2026.1 | Series H | $1.0 Billion | $89.01 | - |
| 2026.5.14 | IPO | $5.55 Billion | $185 | Opened ~$48.8 Billion |
The jump from Series H ($89.01) to IPO ($185) in just 4 months marks a 108% increase. From Series F-1 ($14.66) to IPO ($185) represents a 1,162% increase over 22 months.
11.6 Key Terms of the OpenAI Agreement
- Total Contract Value: Over ~$20 Billion (publicly stated as a $10 Billion base + expandable to 2 GW).
- Compute Scale: 750 MW (Base), up to 2,000 MW (Optional).
- Model: Pure cloud capacity subscription (not hardware sales).
- Operating Loan: OpenAI provides Cerebras with a $1 Billion working capital loan.
- Equity: OpenAI receives warrants to subscribe for Class N non-voting common stock, allowing it to hold up to 10% equity in Cerebras upon completion of certain compute deployment milestones.
- Exclusivity Clause: The contract restricts Cerebras from selling products to Anthropic.
11.7 AWS Bedrock Collaboration
CS-3 systems are deeply integrated as the underlying compute engine into the AWS Bedrock managed inference service, with Trainium handling Prefill and Cerebras handling Decode. This partnership opens a compliant commercial pathway for Cerebras to reach hundreds of thousands of small and medium business customers and enterprise developers, while also substantively substituting its largest geographic customer concentration risk (reliance on a single UAE client) with a U.S.-based alternative.
12. Customer Concentration, Geopolitical Risk, and Competitive Landscape
12.1 Customer Concentration Data
| Customer | 2024 Revenue Share | 2025 Revenue Share | 2025 Accounts Receivable Share | Nature |
|---|---|---|---|---|
| G42 | 85% | 24% | - | Abu Dhabi Tech Holding, Strategic Investor |
| MBZUAI | Not Material | 62% | 77.9% | UAE AI University, G42 Affiliate |
| Total | 85% | 86% | - | Two Customers Represent 86% of Total Revenue |
12.2 The Multifaceted Nature of the G42 Relationship
G42 is simultaneously Cerebras’s customer (made $640 Million in prepayments), supplier (compute collaboration), partner (Condor Galaxy series supercomputer co-operator), investor (participated in Series G/H + holds 3.5 million shares), and a related party (as defined by ASC 850). In 2024, G42 was granted a warrant to purchase 1,857,516 shares of Class N common stock at an exercise price of $0.01 per share.
12.3 Controversial Terms of the OpenAI Warrants
OpenAI received 33.4 million warrants at a symbolic price of $0.01 per share (virtually free) — forming an extreme contrast with the IPO price of $185 and the Series H price of $89.01. The scale of value transfer from these warrants ranges between $3 Billion and $6 Billion (depending on market price), a highly unusual arrangement historically.
12.4 Competitive Landscape: Cerebras vs. Groq vs. SambaNova
| Dimension | Cerebras | Groq | SambaNova |
|---|---|---|---|
| Core Idea | Wafer-Scale Integration (Giant Chip) | LPU (Language Processing Unit) | Reconfigurable Dataflow Architecture |
| Chip Area | 46,225 mm² | Standard Single Chip | Standard Single Chip |
| Process Node | TSMC 5nm | Proprietary LPU (14nm) | TSMC 5nm (SN50) |
| Memory Strategy | On-Chip SRAM 44 GB | SRAM (Deterministic Latency) | Hierarchical Reconfigurable |
| Core Selling Point | Memory Bandwidth / Dual-Use Training & Inference | Extremely Low & Deterministic Latency | Flexible Dataflow Programming |
| Funding/Valuation | IPO ~$48.8 Billion | ~$1 Billion+ | $4 Billion |
12.5 NVIDIA’s Structural Advantages
Although Cerebras holds a clear upper hand in inference latency and memory bandwidth, NVIDIA’s advantages in the following dimensions are difficult to shake in the short term:
- CUDA Ecosystem: 4 million+ developers, the broadest framework support, the most mature model optimization libraries.
- Workload Flexibility: Same hardware used for both training and inference, supporting any model architecture.
- Supply Chain Maturity: Global OEM system integrators, a spare parts market, and well-established enterprise operation and maintenance processes.
- Continuous Memory Capacity Growth: HBM3e to HBM4 (2026) to 1+ TB in 2028, while SRAM density growth is stagnating.
- Community Effects: All new models debut on CUDA/CuDNN; Cerebras requires labor-intensive manual adaptation per model.
13. Summary and Outlook
Cerebras’s wafer-scale chip represents a unique achievement in the history of semiconductor engineering since the invention of the microprocessor. Gene Amdahl challenged WSI in 1980 with Trilogy Systems and ended in devastating failure. Over forty years later, the same problem has been given engineering-grade solutions across five critical dimensions — defect tolerance, reticle stitching, thermal expansion compensation, vertical power delivery, direct liquid cooling — making WSI a commercial reality for the first time.
From a technical perspective, the WSE-3’s 21 PB/s on-chip SRAM bandwidth lowers the Ridge Point for the LLM decode phase to 0.6 FLOP/byte, making a batch size of 1 compute-bound — an impossibility on GPUs. This architectural characteristic arrives at a propitious time in the 2025-2026 rise of the Inference Economy: as inference surpasses training to become AI’s core computational bottleneck, Cerebras’s unique architectural advantage is unlocked.
From a business perspective, Cerebras’s revenue grew from $24.6 million in 2022 to $510 million in 2025 [10][13] (a 20x increase over 3 years), raised $5.55 billion in its IPO, and reached a first-day valuation of $48.8 billion — a historic capital market validation of the WSI technology route. However, the Non-GAAP operating loss widened from $21.8 million to $75.7 million (+247%), cloud gross margin plunged from 68% to 16%, 86% of revenue depends on two related-party UAE customers, and OpenAI received 33.4 million warrants at $0.01/share on out-of-the-money terms — these figures demand calm amid the euphoria.
Cerebras’s technology route will not replace GPUs, but it has established a clear competitive moat in a specific and increasingly important market segment: large model inference. Key observation points for the next 2-3 years:
- The actual delivery cadence and gross margin for the >$20 Billion OpenAI contract.
- CS-3 aggregate throughput / TCO data under high-concurrency scenarios (currently missing).
- The relative race between SRAM density growth and HBM capacity expansion.
- Whether the implementation of Co-Packaged Optics (CPO) technology can shrink the I/O bottleneck by 1-2 orders of magnitude from the current 133,000x gap.
- The change in customer concentration after CFIUS clearance allows more U.S.-based enterprise customers to be onboarded.
14. References
- Cerebras Official Chip Page. https://www.cerebras.ai/chip
- Cerebras WSE-3 Press Release (March 2024). https://www.cerebras.ai/press-release/cerebras-announces-third-generation-wafer-scale-engine
- Wikipedia - Cerebras Systems. https://en.wikipedia.org/wiki/Cerebras_Systems
- IEEE Spectrum - Cerebras WSE-3: Third Generation Superchip for AI (March 2024). https://spectrum.ieee.org/cerebras-chip-cs3
- arXiv - A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems (March 2025). https://arxiv.org/html/2503.11698v1
- Peak FLOPS Substack - Breaking down the Cerebras Wafer Scale Engine (April 2026). https://wafer.substack.com/p/breaking-down-the-cerebras-wafer
- Introl Blog - Cerebras Wafer-Scale Engine: When to Choose Alternative AI Architecture (April 2026). https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
- TechCrunch - The five technical challenges Cerebras overcame (August 2019). https://techcrunch.com/2019/08/19/the-five-technical-challenges-cerebras-overcame-in-building-the-first-trillion-transistor-chip/
- TechCrunch - 600 billion dollar AI chip darling Cerebras almost died early on, burning 8 million dollars a month (May 2026). https://techcrunch.com/2026/05/16/
- Mostly Metrics - Cerebras IPO S-1 Breakdown (April 2026). https://www.mostlymetrics.com/p/cerebras-ipo-s1-breakdown
- Cerebras Blog - 100x Defect Tolerance: How Cerebras Solved the Yield Problem. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
- Cerebras Blog - Cerebras CS-3 vs. Nvidia DGX B200 Blackwell (September 2025). https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-dgx-b200-blackwell
- SEC.gov - Cerebras S-1 Registration Statement (April/May 2026). https://www.sec.gov/Archives/edgar/data/2021728/000162828026025762/cerebras-sx1april2026.htm
- Forbes - Cerebras, Groq And SambaNova Line Up To Compete With Nvidia (October 2025). https://www.forbes.com/sites/karlfreund/2025/10/21/cerebras-groq-and-sambanova-line-up-to-compete-with-nvidia/
- Reuters - Cerebras shares skyrocket in debut (May 2026). https://www.reuters.com/legal/transactional/cerebras-set-debut-stock-market-gripped-by-ai-mania-2026-05-14/
- Sacra Research - Cerebras vs Nvidia. https://sacra.com/research/cerebras-vs-nvidia/
- TechCrunch - Cerebras raises 5.5 billion dollars, then stock pops 108% (May 2026). https://techcrunch.com/2026/05/14/cerebras-raises-5-5b-kicking-off-2026s-ipo-season-with-a-bang/
- Chip Yield Analysis Tool - Cerebras WSE-3 Wafer-Scale Yield Analysis. https://blackyabhishek.github.io/analysis/cerebras_yield_analysis.html
- Cerebras Blog - Supporting PyTorch on the Cerebras Wafer-Scale Engine (April 2022). https://www.cerebras.ai/blog/supporting-pytorch-on-the-cerebras-wafer-scale-engine
- Cell/Device Journal - Performance, efficiency, and cost analysis of wafer-scale AI (2025). https://www.cell.com/device/fulltext/S2666-9986(25)00147-4
- Hot Chips 2024 - Cerebras Wafer-Scale AI Presentation. https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf
- Cerebras Blog - How Cerebras Solved the Yield Problem. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
- Cerebras and AWS Collaboration Press Release (March 2026). https://www.cerebras.ai/press-release/awscollaboration
