1. Overview

Wafer-Scale Integration (WSI) is not an original concept of Cerebras. In 1980, Gene Amdahl, the father of the IBM mainframe, founded Trilogy Systems, attempting to manufacture an entire wafer as a single processor. Trilogy raised $230 million from entities including IBM and Sperry Rand — the largest startup financing in Silicon Valley history at the time — but during prototype testing, the entire wafer short-circuited upon power-up and burned to a dim red glow, metal wiring layers delaminated, and the thermal solution failed completely. Combined with a devastating fab flood and the sudden death of the company president, along with Amdahl himself being seriously injured in a car accident, Trilogy ended in total failure five years after its founding. In the same period, Texas Instruments, ITT, and the U.S. National Security Agency (NSA) all attempted the WSI route, but the shared conclusion was: manufacturing a commercial wafer-scale chip would require 99.99% fabrication yield — something considered impossible to achieve for at least 100 years at the time.

Cerebras Systems was founded in 2015 by the core SeaMicro team (Andrew Feldman, Gary Lauterbach, Michael James, Sean Lie, Jean-Philippe Fricker). SeaMicro had been known in 2007 for its high-density, low-power micro-server designs and was acquired by AMD in 2012 for $334 million. This team had a deep understanding of how to solve system-level bottlenecks with unconventional architectures — a DNA that carried directly into Cerebras’s technical approach to WSI.

Cerebras provided engineering solutions in the following five dimensions that all previous WSI attempts had failed to deliver [6][8]: defect tolerance and yield control, wafer-scale cross-die interconnect (reticle stitching), mechanical compensation for thermal expansion coefficients, a vertical power delivery architecture, and high-flow-rate direct liquid cooling. This transformed WSI, which had been suspended in theoretical discussion since the 1980s, into a mass-producible commercial reality for the first time.

As of 2026, Cerebras has introduced three generations of Wafer-Scale Engine (WSE-1/2/3), building a complete product line from the single-chip CS-3 system to 2,048-node clusters. On May 14, 2026, it completed its IPO on Nasdaq (ticker: CBRS), raising $5.55 billion at $185 per share, with a first-day opening price of $350 and a fully diluted valuation of approximately $48.8 billion [15][17], making it the largest U.S. technology IPO since 2019.


2. Product Evolution: From WSE-1 to WSE-3

2.1 Core Parameter Comparison Across Three Generations of Wafer-Scale Engines

Since launching its first WSE in 2019, Cerebras has driven generational technology leaps on roughly a two-year cycle. The process node evolved from TSMC 16nm to 5nm, transistor count grew from 1.2 trillion to 4 trillion, and core compute power surged from 47 PFLOPS to 125 PFLOPS. Below is a comprehensive physical parameter comparison of the three WSE generations against the NVIDIA H100:

SpecificationWSE-1 (2019)WSE-2 (2021)WSE-3 (2024)NVIDIA H100 (Reference)
Process NodeTSMC 16nmTSMC 7nmTSMC 5nmTSMC 4N
Wafer/Die Area46,225 mm²46,225 mm²46,225 mm²814 mm²
Transistor Count1.2 Trillion2.6 Trillion4.0 Trillion80 Billion
AI-Optimized Cores400,000850,000900,00016,896 (CUDA cores)
On-Chip Memory (SRAM)18 GB40 GB44 GB0.05 GB (L2 Cache)
On-Chip Memory Bandwidth9 PB/s20 PB/s21 PB/s~0.003 PB/s (HBM3)
On-Chip Interconnect Bandwidth100 Pb/s220 Pb/s214 Pb/s0.0576 Pb/s (NVLink)
FP16 Peak Compute47 PFLOPS75 PFLOPS125 PFLOPS~2 PFLOPS
System ProductCS-1CS-2CS-3DGX H100

2.2 Key Changes in Generational Evolution

WSE-1 (2019): The first-generation commercial wafer-scale chip, integrating 400,000 cores and 1.2 trillion transistors on a 16nm process. 18 GB of on-chip SRAM provided 9 PB/s of bandwidth and 47 PFLOPS of compute. The CS-1 system, a 19-inch rack-mountable device, proved the commercial viability of the wafer-scale approach. Initial customers included life sciences institutions like GlaxoSmithKline and AstraZeneca, as well as U.S. national laboratories.

WSE-2 (2021): The process node jumped to 7nm, transistor count doubled to 2.6 trillion, and core count increased to 850,000. 40 GB SRAM with 20 PB/s bandwidth pushed compute to 75 PFLOPS. The WSE-2 entered the Computer History Museum’s collection, named “The Biggest Chip In the World.” The CS-2 system supported training models exceeding 120 trillion parameters for the first time and underpinned the Andromeda (16 interconnected units, 1 ExaFLOP) and Condor Galaxy series supercomputers.

WSE-3 (2024): 5nm process, 4 trillion transistors, 900,000 cores, 44 GB SRAM (SRAM growth approached saturation — only a 10% increase from WSE-2 to WSE-3, while transistor count grew 54%), 21 PB/s bandwidth, 125 PFLOPS. The CS-3, in a 15U chassis and 23 kW power consumption, doubled performance at the same power envelope as the WSE-2 [2][4]. It was named one of Time magazine’s Best Inventions of 2024.

2.3 CS-3 System Specifications

Specification ItemParameter
ProcessorWSE-3 (5nm, 4 Trillion Transistors, 900,000 Cores)
Peak Compute125 PFLOPS (FP16)
On-Chip Memory44 GB SRAM (21 PB/s Bandwidth)
External Memory ExpansionMemoryX (1.5 TB ~ 1.2 PB)
Cluster Scalability Limit2,048 Nodes (256 ExaFLOPs)
CoolingProprietary Water Cooling (100 L/min, 20 C)
Power Consumption~23 kW
Form Factor15U Rack
Model CapacityUp to 24 Trillion Parameters

2.4 Condor Galaxy Supercomputing Network

Cerebras collaborated with Abu Dhabi’s G42 group to deploy the Condor Galaxy (CG) series supercomputers:

SystemAnnouncement DatePeak ComputeTotal Wafer CoresLocation
CG-1July 20234 ExaFLOPs54 MillionUSA
CG-2November 20234 ExaFLOPs54 MillionUSA
CG-3March 2024 (Groundbreaking)8 ExaFLOPs58 MillionDallas
Full Network Aggregate-16 ExaFLOPs166 MillionCross-Region

2.5 Scientific Computing Benchmark: Molecular Dynamics Simulation

In collaboration with Sandia National Laboratories, Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory (LANL), and the U.S. National Nuclear Security Administration (NNSA), researchers successfully simulated high-precision interactions between 800,000 atoms on the WSE-2 [3][20]. The simulation computed with a time step of 1 femtosecond (10^-15 seconds), and a single step on the WSE-2 took only microseconds. Its speed significantly surpassed Frontier, then the world’s top supercomputer built on traditional nodes, demonstrating the wafer-scale architecture’s innate hardware suitability for the extreme local real-time feedback demands of simulating strongly coupled physical systems.


3. Breaking the Reticle Limit: Scribe Line Stitching and Wafer-Scale Lithography

3.1 Physical Constraint: The Reticle Limit

The core limitation in semiconductor lithography comes from the Field of View of the optical lens. For current mainstream Deep Ultraviolet (DUV) and Extreme Ultraviolet (EUV) lithography machines, the maximum pattern area printable in a single exposure is limited by the physical size of the reticle (photomask) — typically 26 mm x 33 mm, approximately 858 mm². This means that, regardless of design, the physical area of any single-exposure die cannot exceed this limit. Traditional chip manufacturers use a Step-and-Repeat process to expose the same pattern multiple times across the wafer, subsequently mechanically cutting along scribe lines to divide the wafer into dozens or hundreds of individual chips.

3.2 Reticle Stitching Process Details

TSMC executes a standard step-and-repeat lithography process on a 300 mm wafer, printing a total of 84 identical dies (each approximately 858 mm², arranged in an 8x10.5 grid). Unlike traditional processes, after standard exposure is complete, Cerebras adds extra lithography steps to fabricate miniature metal wires spanning the scribe line regions in the upper metal layers. These wires, less than 1 mm in length and running in the mid-to-upper levels of the on-chip metallization stack, physically connect the on-chip interconnect network (2D Mesh Fabric) of all 84 dies into a single, continuous plane.

This cross-die interconnect system comprises over 1 million wires. The protocol stack layer includes built-in redundancy mechanisms for defective wires (spare wires + automatic rerouting). From the compiler and software perspective, the boundaries of these 84 dies do not exist — the entire wafer presents as a unified, continuous 2D Mesh compute plane.

Technical Cost: Reticle stitching increases the number of photomasks and production steps, making the wafer manufacturing cost higher than that of a standard GPU wafer. However, Cerebras’s argument is that this additional cost is offset by the elimination of off-chip packaging, inter-chip connectivity, and system integration costs enabled by wafer-scale integration.

3.3 Core Micro-Design: The Physical Significance of a 0.05 mm² Core

A single AI-optimized core in the WSE-3 occupies an area of 0.05 mm² — approximately 1/120th the area of a Streaming Multiprocessor (SM, ~6 mm²) in an NVIDIA H100. This extremely small core size has multiple physical implications:

  • Minimized Defect Cost: At a given defect density, a single defect renders only 0.05 mm² of silicon area non-functional, instead of 6 mm² — a 120x reduction in the economic cost of defects.
  • Fine-Grained Redundancy: Within a fixed silicon area, a far greater number of physical cores than functionally required can be integrated, providing ample redundant spares.
  • Short Interconnect Latency: The physical distance between cores is on the order of tens of micrometers, with signal propagation delay being just 1 clock cycle.

Within this tiny 0.05 mm² space, the silicon area allocation is roughly: about 50% for a 48 KB single-cycle SRAM, and the remaining 50% for general-purpose tensor and sparse algebra computation logic composed of approximately 110,000 standard gates. A single core’s peak power consumption at a frequency of 1.1 GHz is only 30 mW.


4. Defect Tolerance and Yield

4.1 The Physical Reality of Defect Density

The typical defect density for TSMC’s 5nm process is approximately 0.001 defects per mm² (data for mature nodes). On the 46,225 mm² WSE-3, this density translates to roughly 46 random physical defects per wafer. In traditional chip manufacturing, any single one of these defects falling within the active area of a chip renders the entire die non-functional — this is the fundamental reason why, for 75 years, chips have been made smaller, not larger.

4.2 Cerebras’s Solution: A Three-Layer Mechanism for 100x Defect Tolerance

Layer 1: Core Miniaturization. A single core area of 0.05 mm² versus the ~6 mm² of an H100 SM creates an asymmetry in defect cost. At the identical defect density of 0.001 defects/mm²: one defect on WSE-3 has a 50% probability of landing inside a core area, with an expected loss of 0.025 mm² of silicon area; one defect on H100 has a 99.8% probability of landing inside an SM area, with an expected loss of approximately 3 mm² of silicon area. Extrapolating from this:

Layer 2: Physical Redundancy. WSE-3 physically integrates 970,000 cores on the wafer but nominally enables only 900,000. The 70,000 extra cores (approximately 7.2% physical redundancy) provide ample spare capacity.

Layer 3: Fail-in-Place Resilient Routing. During the chip’s power-on initialization phase, test logic identifies the locations of all defective cores. The on-chip reconfigurable interconnect network then automatically bypasses the failed cores, remapping neighboring healthy cores to the corresponding positions in the logical grid. This process is completed entirely automatically in hardware and is transparent to the software layer.

The net effect of this three-layer mechanism: The effective active silicon area ratio of the WSE-3 reaches approximately 93% (900,000 / 970,000), achieving a usable yield at commercial scale comparable to diced chip processes. Cerebras’s core insight is that solving the yield problem does not depend on reducing defects, but on making the economic cost of each defect approach zero.


5. Micro-Core Architecture and On-Chip Dataflow Network

5.1 Internal Structure of a Compute Core

Each WSE core internally contains:

  • 48 KB single-cycle SRAM, using an 8-Bank split architecture (6 KB per Bank, 32-bit width), supporting simultaneous conflict-free access of 2x 64-bit reads + 1x 64-bit write per clock cycle.
  • 256 Bytes of software-managed cache, specifically for storing high-frequency changing data structures like accumulators.
  • Compute logic composed of 110,000 standard gates, supporting tensor multiply-accumulate and sparse matrix operations.
  • Native sparse triggering in the instruction set: upon detecting an input weight is zero, it automatically skips the multiply-accumulate operation, yielding a several-fold effective speedup when processing highly sparse large language models.

5.2 On-Chip Interconnect Network Architecture

The WSE builds a high-speed interconnect network based on a 2D Mesh topology. Each core integrates a 5-port structural router (East, West, South, North, Local), supporting bidirectional 32-bit single-cycle data transfer. Each physical transmission packet consists of 16-bit compute data + 16-bit index data, perfectly matching the coordinate addressing requirements of sparse matrix computation.

Network communication is divided at the hardware level into 24 independently configurable static routing colors (Colors). Each color has hardware-isolated dedicated buffer queues, sharing the physical bus via Time-Multiplexing for non-blocking transmission. The on-chip Fabric natively supports hardware-level single-cycle Broadcast and Multicast. Since physical wires between cores are only tens of micrometers, cross-core signal latency is just 1 clock cycle (~0.9 ns at 1.1 GHz).

Fundamental Architectural Difference from GPU: The WSE uses a Dataflow Architecture — computation is driven by the arrival of data. 32-bit wavelet messages travel through the 2D grid; the wavelet’s 5-bit color tag determines the routing path and trigger task. When a wavelet arrives at a specific color channel, the bound task is launched for execution. If the weight is zero, no wavelet is emitted, achieving unstructured sparsity acceleration. In contrast, GPUs use a control flow architecture (SIMT/Warp) — execution is driven by the program counter; all 32 threads in the same warp execute the same instruction and cannot skip zero-value computations.

5.3 On-Chip Memory Hierarchy

LevelMediumCapacityAggregate BandwidthLatencyPhysical Location
L0 - RegisterCore-Private256 B5.3 TB/s (Single Core Peak)1 cycleInside Core
L1 - SRAMCore-Private48 KB x 900,000 = 44 GB21 PB/s (Full Chip Aggregate)1 cycleInside Core
L2 - MemoryXDRAM + Flash1.5 TB ~ 1.2 PBProprietary ProtocolHighExternal Cabinet
L3 - SwarmXSwitch NetworkCluster-LevelBroadcast/Reduce Hardware AccelerationTopology-DependentCluster Interconnect

The 21 PB/s aggregate bandwidth of the 44 GB on-chip SRAM in the WSE-3 fundamentally changes the compute economics. Compared to the 3 TB/s of H100’s HBM3, the difference is a factor of 7,000. More critically, SRAM bandwidth scales linearly with capacity (each bank can be read in parallel by adjacent compute units), whereas HBM bandwidth is limited by the number of physical channels — this is an architectural difference, not one that can be bridged by process advancements alone.


6. Power Delivery, Thermal Management, and Thermal Expansion Mechanical Compensation Engineering

6.1 Power Delivery Challenge: Voltage Consistency at 23 kW

The WSE-3 has a full-load rated power consumption of 23 kW, with an operating voltage of approximately 0.8-0.9V (sub-volt level), requiring a continuous current injection of roughly 28,750 to 30,000 Amps. In a traditional horizontal two-dimensional power delivery architecture, power is routed from the chip edge through lateral bus bars on the PCB. Due to the physical impedance of metal wiring, a current of 30 kA level crossing a 215 mm wafer would generate a catastrophic IR Drop — theoretically, the voltage drop from edge to center could reach 9.6V, while the chip’s operating voltage is only 1V, making it impossible to power cores in the central region.

Solution: 3D Vertical Power Delivery. Cerebras placed a custom multi-layer high-density power distribution PCB directly behind the wafer, embedding over 300 high-frequency step-down Voltage Regulator Modules (VRMs). Current is projected perpendicularly to the wafer surface directly onto micro-electrical contacts on the back of each core, over a physical distance of only a few millimeters. Each of the 84 die areas has its voltage regulated independently, completely eliminating lateral IR Drop. The entire power delivery network is encapsulated within a four-layer physical sandwich called the Engine Block: Cold Plate, Wafer, Custom Compliant Connector, Power PCB.

6.2 Coefficient of Thermal Expansion Mismatch and Mechanical Compensation

The core packaging challenge faced by the system originates from the Coefficient of Thermal Expansion (CTE) mismatch of heterogeneous materials:

MaterialCTE (ppm/C)Corner Displacement under 65 C Rise (215 mm x 215 mm)
Silicon2.6~36 µm
FR-4 PCB15 (Lateral)~210 µm
Copper (Cold Plate)17~238 µm

The expansion of the PCB under a 65 C temperature rise is roughly 5.8 times that of silicon. For traditional packaging (BGA, flip-chip, wire bonding), a 122 µm relative corner displacement (PCB vs. Si) already exceeds their failure threshold by a factor of 5-7.

Solution: Co-founder Jean-Philippe Fricker led the design of a custom Compliant Elastomeric Connector. This connector, sandwiched between the wafer and the PCB, maintains good electrical conductivity in the vertical direction while possessing high physical compliance and deformation resilience in the horizontal shear direction. When temperature differences cause the PCB to expand more than the silicon, the elastic connector layer absorbs all shear stress through microscopic physical shear deformation, ensuring contact reliability for hundreds of thousands of power and signal pins.

Additionally, a dynamic Ambulating Thermal Interface (ATI) material is embedded between the cold plate and the back of the wafer. Composed of a high thermal conductivity material and a physical friction-reducing material laminated together, it allows the water-cooled copper plate to undergo micron-level lossless horizontal sliding against the silicon surface during thermal deformation, preventing stress transfer that could physically fracture the silicon wafer.

No off-the-shelf automated equipment could precisely handle such a large-area, fragile, heterogeneous 3D layered assembly. Cerebras designed and built dedicated high-precision alignment and pressure assembly machines from scratch to close the mechanical structure loop for wafer-scale components.

6.3 Liquid Cooling System

The system employs Direct-to-Chip Liquid Cooling. Dual-redundant industrial high-pressure pumps inject cooling water at 20 +/- 2 C, at a flow rate of 100 +/- 10 L/min, into a brass manifold cold plate conforming to the wafer surface. The interior of the cold plate is machined with micro-fin channels to maximize heat exchange surface area.

At the data center level, row-based and in-rack high-precision fluid manifold controls are deployed, with digital monitoring of flow and pressure to eliminate stagnant zones. The CSoft software layer runs dynamic duty-cycle dummy operations (Power Ramp Smoothing via Dummy Operations) when no computational workload is active, smoothing power transients to stay within electrical safety boundaries and preventing potential physical damage to the wafer-scale system from instantaneous, drastic voltage fluctuations.


7. CSoft Compilation Pipeline and Execution Modes

7.1 Compiler Hierarchy

The core of the Cerebras CSoft software platform is the Cerebras Graph Compiler (CGC), responsible for losslessly mapping a PyTorch/TensorFlow computational graph onto the physical grid of 900,000 cores. The compilation pipeline follows a stepwise Lowering logic:

PyTorch Model – Lazy Tensor Backend (ATen Operator Graph Capture) – XLA HLO (High-Level Optimization Custom Calls) – CIRH (Cerebras IR High, an MLIR dialect, full-graph level rewrite passes) – Operator Deep Fusion, Constant Folding, Common Subexpression Elimination, Dead Code Pruning – Pattern Matching against a Pre-built Library of Hand-Tuned High-Performance Kernels –【Match Success】Directly Generate Optimized Instructions –【Match Failure】CLAIR/LAIR (Low-Level Linear Algebra IR) – AutoGen Automated Kernel Compiler – Polyhedral Space Transformation Mathematical Optimization – 2D Core Topology-Aware Placement – 8-Bank SRAM Allocation – Final Physical Instruction Machine Code.

The AutoGen kernel compiler supports four strategies: default, disabled, medium, and aggressive, providing adaptive kernel generality for customized development.

Key Constraint: The computational graph must be a Static Graph — dynamic shapes or data-dependent branching are not supported. The routing table is fixed once at compile time for all 900,000 cores.

7.2 Two Execution Modes

Layer-Pipelined Mode - Model Fully Resident On-Chip:

flowchart LR
    Input[Input Data Stream] --> L1[WSE Partition 1
Layer 1] L1 --> L2[WSE Partition 2
Layer 2] L2 --> Ldots[...] Ldots --> LN[WSE Partition N
Layer N] LN --> Output[Output]

Characteristics: Model parameters are fully resident on-chip. Multiple micro-batches run concurrently, spatially interleaved across the wafer. The compiler must solve a VLSI floorplanning problem — one Cerebras proved to be NP-hard at ISPD 2020. Applicable for models whose parameters fit within the 44 GB SRAM.

Weight Streaming Mode - Current Default:

flowchart LR
    subgraph MemoryX[External MemoryX Storage]
        W1[Layer 1 Weights]
        W2[Layer 2 Weights]
        WN[Layer N Weights]
    end
    W1 --> WS[WSE Full Chip
900,000 Cores
Single Layer Computation] W2 --> WS WN --> WS WS --> Grad[Gradient Return] Grad --> MemoryX

Characteristics: The entire wafer processes a single layer at a time. Weights reside in external MemoryX (DRAM + Flash, up to 1.2 PB) and are streamed to the wafer layer-by-layer. All 900,000 cores process the same layer; activations remain on-chip, and weights are discarded after computation. Scaling to 2,048 systems requires only changing a single flag.

7.3 The Order-of-Magnitude Difference in Software Complexity

Training a 175 billion parameter large model on a GPU cluster typically requires ~20,000 lines of distributed training code (a combination of Tensor Parallelism + Pipeline Parallelism + FSDP + DeepSpeed + Megatron-LM). Cerebras claims an equivalent scale requires just 565 lines of PyTorch code, and that the software complexity of training a 1 trillion parameter model is comparable to training a 1 billion parameter model on GPUs — a frequently underestimated competitive advantage.

7.4 Developer Interfaces

  • AI Inference Users: OpenAI-compatible API, zero learning curve.
  • Model Training Users: Standard PyTorch / TensorFlow frameworks; CSoft handles the low-level details.
  • HPC Developers: CSL SDK — a Zig-based DSL allowing programming of individual cores, manual configuration of routing tables, and adaptation of code and data within the 48 KB memory. There are no thread concepts, no shared memory, no kernel launches, but also no need to handle synchronization or race conditions.

8. Wafer-Scale Supercomputer Cluster Architecture

8.1 Cluster Components

ComponentFunctionSpecification
CS-3Single System Compute Unit15U, 23 kW, 125 PFLOPS
MemoryXExternal Weight Storage Node1.5 TB ~ 1.2 PB
SwarmXCluster Switching NetworkHardware-Level Broadcast + Gradient Reduce/Sum
CSLCluster Interconnect TopologyUp to 2,048 Nodes, 256 ExaFLOPs
AI400X2Parallel File Storage90+ GB/s Sustained Bandwidth, 3M+ IOPS

8.2 DARPA Co-Packaged Optics Project

Under DARPA funding, Cerebras is collaborating with Ranovus to develop wafer-scale Co-Packaged Optics (CPO). The goal is to directly mount Ranovus’s optical fiber transceivers onto the edge of the wafer, replacing traditional electrical off-chip interconnects with a multi-wavelength, multi-mode fiber optic network. This solution can provide over 100x the data throughput capacity of conventional CPO solutions while drastically reducing the power consumption of cluster-level parameter transfer.


9. Wafer-Scale Architecture vs. NVIDIA GPU Clusters

9.1 System-Level Physical Specification Comparison

Evaluation DimensionCerebras CS-3NVIDIA B200 (Single Card)NVIDIA DGX B200 (8-Card Node)NVIDIA GB200 NVL72 (Full Rack)
Core Physical FormSingle Wafer Integrated (15U Chassis)Dual-Die Bridged (SXM Module)8-Card Parallel (10U Chassis)72-Card High-Density Parallel (Full Rack)
FP16 Peak Compute125 PFLOPS4.4 PFLOPS36 PFLOPS360 PFLOPS
On-Board Memory Capacity44 GB SRAM (On-Chip)192 GB HBM3e1.5 TB HBM3e13.5 TB HBM3e
Memory Access Bandwidth21,000 TB/s (21 PB/s)8.0 TB/s64 TB/s576 TB/s
Inter-Chip Communication PerformanceOn-Chip Metal Traces, Zero External LossSXM Physical SocketNVSwitch On-Board Routing9 Sets High-Speed Copper + Optical Switch
Max Rated Power Consumption~23 kW~1,200 W~14.3 kW~120 kW
Rack Space15USingle Slot10U42U
LLM Training Programming ComplexityPure Data Parallelism, ~565 Lines of CodeNot ApplicableTensor/Pipeline/FSDP CombinationExtremely Complex Network Topology Configuration

9.2 Training Deployment: Parameter Scale vs. Physical Requirements

Model Parameter ScaleGPU Cluster Requirement (B200)Cerebras CS-3 Requirement
100 Billion (100B)>=12 B200 GPUs + NVSwitch + InfiniBand1x CS-3 + 2.4 TB MemoryX
1 Trillion (1T)Hundreds of B200 GPUs + Fiber Interconnect Cluster1x CS-3 + 1.2 PB MemoryX
10 Trillion (10T)1,000+ B200 Servers1x CS-3 + 1.2 PB MemoryX

9.3 Inference: Roofline Model Analysis

The arithmetic intensity formula for the LLM token generation (decode) phase is: Arithmetic Intensity = FLOPs / Bytes, approximately 1 FLOP/byte. The Ridge Point = Peak FLOPS / Peak Memory Bandwidth.

ChipFP16 PeakMemory BandwidthRidge PointDecode Arithmetic IntensityState
H100989 TFLOPS3.35 TB/s295 FLOP/byte1 FLOP/byte99.7% Compute Units Idle
B2004.4 PFLOPS8 TB/s550 FLOP/byte1 FLOP/byte>99.8% Idle
WSE-312.5 PFLOPS21 PB/s0.6 FLOP/byte1 FLOP/byteCompute Bound

The WSE-3 is the only chip that is compute-bound during the decode phase at a batch size of 1 (Batch-1). This means that a single user request can achieve full hardware utilization, eliminating the need for GPUs to rely on large batch sizes to amortize the overhead of reading weights from HBM.

Measured Inference Performance Comparison (Independent Benchmarks):

ModelCerebras CS-3NVIDIA DGX B200Advantage Factor
Llama 4 Maverick (400B)2,500+ tok/s/user[7][12]~1,000 tok/s/user2.5x
gpt-oss-120B (10 Concurrency)2,700+ tok/s580 tok/s4.7x
DeepSeek R1 70B1,600 tok/s--
Perplexity Sonar1,200 tok/s--

9.4 Inference TCO Comparison

Cerebras CS-3 pricing is approximately $2-3 million per node. DGX B200 (8x B200) is around $300,000. However, Cerebras claims that under an equivalent online token generation load, the CS-3’s comprehensive TCO (hardware CapEx + power OpEx) is approximately 32% lower than DGX B200, while delivering 21x faster single-token interaction speed. The core of this pricing logic is that a single CS-3 is equivalent in inference throughput to multiple DGX B200s and eliminates the significant hidden costs of InfiniBand networks, multi-rack space, and cluster management software.


10. I/O Bottleneck

10.1 I/O Bottleneck: A 133,000x Gap

On-Chip SRAM Bandwidth: 21 PB/s. Off-Chip MemoryX/Network I/O: approximately 150-200 GB/s. Gap: approximately 133,000x.

For models that fit within the 44 GB SRAM, this is not an issue — all data circulates on-chip. However, mainstream large language models are rapidly surpassing this capacity (Llama 4 Maverick 400B requires 800 GB of weights in FP16). In Weight Streaming mode, the speed at which weights for each layer are streamed from MemoryX is bottlenecked by the off-chip bandwidth, making I/O the bottleneck.

The collaboration model with AWS — where Trainium handles Prefill and Cerebras handles Decode — is essentially an implicit acknowledgment of the chip’s insufficient efficiency during the compute-bound prefill stage.

10.2 Stagnation in SRAM Density Growth

The area of a high-density SRAM bitcell has essentially stalled at approximately 0.021 µm² from the 5nm node down to 3nm and even 2nm nodes. SRAM scaling faces physical limits (leakage and stability constraints of the 6T cell). Concurrently, HBM is progressing from HBM3 (~5 Gb/s/pin) towards HBM4 (8+ Gb/s/pin), with stack layers increasing from 12 to 16, and capacity heading towards 1+ TB by 2028. Cerebras’s SRAM capacity advantage will face a structural challenge within 2-3 generations.

10.3 The Undisclosed Throughput Crossover Point

Cerebras’s headline-grabbing tok/s figures are single-user speeds. GPUs boost aggregate throughput by batch processing users — serving multiple concurrent requests from the same weight read: at a batch size of 10-20, a single H100’s aggregate throughput may already match a single CS-3; at a batch size of 128+, a DGX H100 system can generate thousands of aggregate tok/s at significantly lower hardware cost. Cerebras has never published its aggregate throughput figures under high-concurrency scenarios. This is the most significant missing data point in publicly available materials.


11. Business Closure and Capital Landscape

11.1 Revenue Growth Trajectory

YearRevenueYoY GrowthKey Drivers
2022$24.6 Million-Early deployments for life sciences customers
2023$78.7 Million220%CS-2 shipments began
2024$290.3 Million269%CG-1/2 delivery + G42 prepayment momentum
2025$510 Million76%Large MBZUAI order + Inference cloud launch

Quarterly trend acceleration is notable: Q1 2025 was $99.5 Million, and Q4 2025 reached $171.4 Million (an annualized run rate of ~$686 Million).

11.2 Deconstructing GAAP Profit: Paper Gains vs. Substantive Losses

Metric20242025Change
GAAP Net Income (Loss)($481.6 Million)$237.8 MillionTurned Profitable
Includes: One-Time Non-Cash Gain from Extinguishment of Forward Contract Liabilities-$363.3 MillionPaper Adjustment
Stock-Based Compensation (SBC)-$49.8 MillionActual Cash Expense
Non-GAAP Operating Net Loss($21.8 Million)($75.7 Million)Loss Widened by 247%
Operating Cash Flow$452 Million($10 Million)Turned from Positive to Negative

Key Insight: The $237.8 million GAAP net profit in 2025 is a paper profit, driven by a one-time non-cash gain resulting from the recapitalization of forward purchase contract liabilities with G42. Excluding this, the Non-GAAP operating loss widened from $21.8 million to $75.7 million — a 247% increase in losses. The reason operating cash flow turned from positive to negative is that 2024 included $640.3 million in customer prepayments from G42 (recorded positively in operating cash flow), whereas in 2025, positive prepayments decreased and capacity previously sold was being delivered against.

11.3 Gross Margin Structure

Business Line20242025Q1 2025Q2 2025Q3 2025Q4 2025
Consolidated Gross Margin42%39%----
Hardware Gross Margin-43%----
Cloud Services Gross Margin-30%68%26%16%21%

Cloud gross margin plummeted from 68% in Q1 to 16% in Q3, reflecting severe underutilization of capacity in newly built data centers. Cerebras is shifting from a high-margin hardware sales model to a lower-margin cloud services model (the business model of the OpenAI contract). The degree of gross margin convergence in this structural shift will determine the company’s long-term earnings power.

11.4 IPO Pricing History

DateEventPrice Range / PricingNotes
2025.10Initial Confidential S-1 FilingTerminatedStalled due to CFIUS review
2026.05.04Amended S-1 Public Filing$115-$12528 Million Shares
2026.05.10First Upward Revision$125-$135Oversubscription exceeded expectations
2026.05.11Second Upward Revision$150-$16030 Million Shares
2026.05.13Final Pricing$18520x Oversubscribed
2026.05.14First Day Open$350+89% vs. Pricing
2026.05.14First Day Close$311Fully Diluted Valuation ~$48.8 Billion

The consistency from the pre-IPO secondary market (Hiive) average trading price of $187.53 to the IPO pricing of $185 indirectly confirms the market’s high expectations for Cerebras’s concept of a differentiated AI processor.

11.5 Funding History

DateRoundAmountPer ShareImplied Valuation
2016.5Series A$27 Million--
2018.11Series D$88 Million-Unicorn Status
2019.11Series E$270 Million-$2.4 Billion
2021.11Series F$250 Million-Over $4.0 Billion
2024.7-9Series F-1$85 Million$14.66-
2025.9Series G$1.1 Billion$36.23$8.1 Billion
2026.1Series H$1.0 Billion$89.01-
2026.5.14IPO$5.55 Billion$185Opened ~$48.8 Billion

The jump from Series H ($89.01) to IPO ($185) in just 4 months marks a 108% increase. From Series F-1 ($14.66) to IPO ($185) represents a 1,162% increase over 22 months.

11.6 Key Terms of the OpenAI Agreement

  • Total Contract Value: Over ~$20 Billion (publicly stated as a $10 Billion base + expandable to 2 GW).
  • Compute Scale: 750 MW (Base), up to 2,000 MW (Optional).
  • Model: Pure cloud capacity subscription (not hardware sales).
  • Operating Loan: OpenAI provides Cerebras with a $1 Billion working capital loan.
  • Equity: OpenAI receives warrants to subscribe for Class N non-voting common stock, allowing it to hold up to 10% equity in Cerebras upon completion of certain compute deployment milestones.
  • Exclusivity Clause: The contract restricts Cerebras from selling products to Anthropic.

11.7 AWS Bedrock Collaboration

CS-3 systems are deeply integrated as the underlying compute engine into the AWS Bedrock managed inference service, with Trainium handling Prefill and Cerebras handling Decode. This partnership opens a compliant commercial pathway for Cerebras to reach hundreds of thousands of small and medium business customers and enterprise developers, while also substantively substituting its largest geographic customer concentration risk (reliance on a single UAE client) with a U.S.-based alternative.


12. Customer Concentration, Geopolitical Risk, and Competitive Landscape

12.1 Customer Concentration Data

Customer2024 Revenue Share2025 Revenue Share2025 Accounts Receivable ShareNature
G4285%24%-Abu Dhabi Tech Holding, Strategic Investor
MBZUAINot Material62%77.9%UAE AI University, G42 Affiliate
Total85%86%-Two Customers Represent 86% of Total Revenue

12.2 The Multifaceted Nature of the G42 Relationship

G42 is simultaneously Cerebras’s customer (made $640 Million in prepayments), supplier (compute collaboration), partner (Condor Galaxy series supercomputer co-operator), investor (participated in Series G/H + holds 3.5 million shares), and a related party (as defined by ASC 850). In 2024, G42 was granted a warrant to purchase 1,857,516 shares of Class N common stock at an exercise price of $0.01 per share.

12.3 Controversial Terms of the OpenAI Warrants

OpenAI received 33.4 million warrants at a symbolic price of $0.01 per share (virtually free) — forming an extreme contrast with the IPO price of $185 and the Series H price of $89.01. The scale of value transfer from these warrants ranges between $3 Billion and $6 Billion (depending on market price), a highly unusual arrangement historically.

12.4 Competitive Landscape: Cerebras vs. Groq vs. SambaNova

DimensionCerebrasGroqSambaNova
Core IdeaWafer-Scale Integration (Giant Chip)LPU (Language Processing Unit)Reconfigurable Dataflow Architecture
Chip Area46,225 mm²Standard Single ChipStandard Single Chip
Process NodeTSMC 5nmProprietary LPU (14nm)TSMC 5nm (SN50)
Memory StrategyOn-Chip SRAM 44 GBSRAM (Deterministic Latency)Hierarchical Reconfigurable
Core Selling PointMemory Bandwidth / Dual-Use Training & InferenceExtremely Low & Deterministic LatencyFlexible Dataflow Programming
Funding/ValuationIPO ~$48.8 Billion~$1 Billion+$4 Billion

12.5 NVIDIA’s Structural Advantages

Although Cerebras holds a clear upper hand in inference latency and memory bandwidth, NVIDIA’s advantages in the following dimensions are difficult to shake in the short term:

  • CUDA Ecosystem: 4 million+ developers, the broadest framework support, the most mature model optimization libraries.
  • Workload Flexibility: Same hardware used for both training and inference, supporting any model architecture.
  • Supply Chain Maturity: Global OEM system integrators, a spare parts market, and well-established enterprise operation and maintenance processes.
  • Continuous Memory Capacity Growth: HBM3e to HBM4 (2026) to 1+ TB in 2028, while SRAM density growth is stagnating.
  • Community Effects: All new models debut on CUDA/CuDNN; Cerebras requires labor-intensive manual adaptation per model.

13. Summary and Outlook

Cerebras’s wafer-scale chip represents a unique achievement in the history of semiconductor engineering since the invention of the microprocessor. Gene Amdahl challenged WSI in 1980 with Trilogy Systems and ended in devastating failure. Over forty years later, the same problem has been given engineering-grade solutions across five critical dimensions — defect tolerance, reticle stitching, thermal expansion compensation, vertical power delivery, direct liquid cooling — making WSI a commercial reality for the first time.

From a technical perspective, the WSE-3’s 21 PB/s on-chip SRAM bandwidth lowers the Ridge Point for the LLM decode phase to 0.6 FLOP/byte, making a batch size of 1 compute-bound — an impossibility on GPUs. This architectural characteristic arrives at a propitious time in the 2025-2026 rise of the Inference Economy: as inference surpasses training to become AI’s core computational bottleneck, Cerebras’s unique architectural advantage is unlocked.

From a business perspective, Cerebras’s revenue grew from $24.6 million in 2022 to $510 million in 2025 [10][13] (a 20x increase over 3 years), raised $5.55 billion in its IPO, and reached a first-day valuation of $48.8 billion — a historic capital market validation of the WSI technology route. However, the Non-GAAP operating loss widened from $21.8 million to $75.7 million (+247%), cloud gross margin plunged from 68% to 16%, 86% of revenue depends on two related-party UAE customers, and OpenAI received 33.4 million warrants at $0.01/share on out-of-the-money terms — these figures demand calm amid the euphoria.

Cerebras’s technology route will not replace GPUs, but it has established a clear competitive moat in a specific and increasingly important market segment: large model inference. Key observation points for the next 2-3 years:

  1. The actual delivery cadence and gross margin for the >$20 Billion OpenAI contract.
  2. CS-3 aggregate throughput / TCO data under high-concurrency scenarios (currently missing).
  3. The relative race between SRAM density growth and HBM capacity expansion.
  4. Whether the implementation of Co-Packaged Optics (CPO) technology can shrink the I/O bottleneck by 1-2 orders of magnitude from the current 133,000x gap.
  5. The change in customer concentration after CFIUS clearance allows more U.S.-based enterprise customers to be onboarded.

14. References

  1. Cerebras Official Chip Page. https://www.cerebras.ai/chip
  2. Cerebras WSE-3 Press Release (March 2024). https://www.cerebras.ai/press-release/cerebras-announces-third-generation-wafer-scale-engine
  3. Wikipedia - Cerebras Systems. https://en.wikipedia.org/wiki/Cerebras_Systems
  4. IEEE Spectrum - Cerebras WSE-3: Third Generation Superchip for AI (March 2024). https://spectrum.ieee.org/cerebras-chip-cs3
  5. arXiv - A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems (March 2025). https://arxiv.org/html/2503.11698v1
  6. Peak FLOPS Substack - Breaking down the Cerebras Wafer Scale Engine (April 2026). https://wafer.substack.com/p/breaking-down-the-cerebras-wafer
  7. Introl Blog - Cerebras Wafer-Scale Engine: When to Choose Alternative AI Architecture (April 2026). https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
  8. TechCrunch - The five technical challenges Cerebras overcame (August 2019). https://techcrunch.com/2019/08/19/the-five-technical-challenges-cerebras-overcame-in-building-the-first-trillion-transistor-chip/
  9. TechCrunch - 600 billion dollar AI chip darling Cerebras almost died early on, burning 8 million dollars a month (May 2026). https://techcrunch.com/2026/05/16/
  10. Mostly Metrics - Cerebras IPO S-1 Breakdown (April 2026). https://www.mostlymetrics.com/p/cerebras-ipo-s1-breakdown
  11. Cerebras Blog - 100x Defect Tolerance: How Cerebras Solved the Yield Problem. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
  12. Cerebras Blog - Cerebras CS-3 vs. Nvidia DGX B200 Blackwell (September 2025). https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-dgx-b200-blackwell
  13. SEC.gov - Cerebras S-1 Registration Statement (April/May 2026). https://www.sec.gov/Archives/edgar/data/2021728/000162828026025762/cerebras-sx1april2026.htm
  14. Forbes - Cerebras, Groq And SambaNova Line Up To Compete With Nvidia (October 2025). https://www.forbes.com/sites/karlfreund/2025/10/21/cerebras-groq-and-sambanova-line-up-to-compete-with-nvidia/
  15. Reuters - Cerebras shares skyrocket in debut (May 2026). https://www.reuters.com/legal/transactional/cerebras-set-debut-stock-market-gripped-by-ai-mania-2026-05-14/
  16. Sacra Research - Cerebras vs Nvidia. https://sacra.com/research/cerebras-vs-nvidia/
  17. TechCrunch - Cerebras raises 5.5 billion dollars, then stock pops 108% (May 2026). https://techcrunch.com/2026/05/14/cerebras-raises-5-5b-kicking-off-2026s-ipo-season-with-a-bang/
  18. Chip Yield Analysis Tool - Cerebras WSE-3 Wafer-Scale Yield Analysis. https://blackyabhishek.github.io/analysis/cerebras_yield_analysis.html
  19. Cerebras Blog - Supporting PyTorch on the Cerebras Wafer-Scale Engine (April 2022). https://www.cerebras.ai/blog/supporting-pytorch-on-the-cerebras-wafer-scale-engine
  20. Cell/Device Journal - Performance, efficiency, and cost analysis of wafer-scale AI (2025). https://www.cell.com/device/fulltext/S2666-9986(25)00147-4
  21. Hot Chips 2024 - Cerebras Wafer-Scale AI Presentation. https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf
  22. Cerebras Blog - How Cerebras Solved the Yield Problem. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
  23. Cerebras and AWS Collaboration Press Release (March 2026). https://www.cerebras.ai/press-release/awscollaboration