Cerebras

1. Overview

Wafer-Scale Integration (WSI) is not an original concept of Cerebras. In 1980, Gene Amdahl, the father of the IBM mainframe, founded Trilogy Systems, attempting to manufacture an entire wafer as a single processor. Trilogy raised $230 million from entities including IBM and Sperry Rand — the largest startup financing in Silicon Valley history at the time — but during prototype testing, the entire wafer short-circuited upon power-up and burned to a dim red glow, metal wiring layers delaminated, and the thermal solution failed completely. Combined with a devastating fab flood and the sudden death of the company president, along with Amdahl himself being seriously injured in a car accident, Trilogy ended in total failure five years after its founding. In the same period, Texas Instruments, ITT, and the U.S. National Security Agency (NSA) all attempted the WSI route, but the shared conclusion was: manufacturing a commercial wafer-scale chip would require 99.99% fabrication yield — something considered impossible to achieve for at least 100 years at the time.

Cerebras Systems was founded in 2015 by the core SeaMicro team (Andrew Feldman, Gary Lauterbach, Michael James, Sean Lie, Jean-Philippe Fricker). SeaMicro had been known in 2007 for its high-density, low-power micro-server designs and was acquired by AMD in 2012 for $334 million. This team had a deep understanding of how to solve system-level bottlenecks with unconventional architectures — a DNA that carried directly into Cerebras’s technical approach to WSI.

Cerebras provided engineering solutions in the following five dimensions that all previous WSI attempts had failed to deliver [6][8]: defect tolerance and yield control, wafer-scale cross-die interconnect (reticle stitching), mechanical compensation for thermal expansion coefficients, a vertical power delivery architecture, and high-flow-rate direct liquid cooling. This transformed WSI, which had been suspended in theoretical discussion since the 1980s, into a mass-producible commercial reality for the first time.

As of 2026, Cerebras has introduced three generations of Wafer-Scale Engine (WSE-1/2/3), building a complete product line from the single-chip CS-3 system to 2,048-node clusters. On May 14, 2026, it completed its IPO on Nasdaq (ticker: CBRS), raising $5.55 billion at $185 per share, with a first-day opening price of $350 and a fully diluted valuation of approximately $48.8 billion [15][17], making it the largest U.S. technology IPO since 2019.

2. Product Evolution: From WSE-1 to WSE-3

2.1 Core Parameter Comparison Across Three Generations of Wafer-Scale Engines

Since launching its first WSE in 2019, Cerebras has driven generational technology leaps on roughly a two-year cycle. The process node evolved from TSMC 16nm to 5nm, transistor count grew from 1.2 trillion to 4 trillion, and core compute power surged from 47 PFLOPS to 125 PFLOPS. Below is a comprehensive physical parameter comparison of the three WSE generations against the NVIDIA H100:

Specification	WSE-1 (2019)	WSE-2 (2021)	WSE-3 (2024)	NVIDIA H100 (Reference)
Process Node	TSMC 16nm	TSMC 7nm	TSMC 5nm	TSMC 4N
Wafer/Die Area	46,225 mm²	46,225 mm²	46,225 mm²	814 mm²
Transistor Count	1.2 Trillion	2.6 Trillion	4.0 Trillion	80 Billion
AI-Optimized Cores	400,000	850,000	900,000	16,896 (CUDA cores)
On-Chip Memory (SRAM)	18 GB	40 GB	44 GB	0.05 GB (L2 Cache)
On-Chip Memory Bandwidth	9 PB/s	20 PB/s	21 PB/s	~0.003 PB/s (HBM3)
On-Chip Interconnect Bandwidth	100 Pb/s	220 Pb/s	214 Pb/s	0.0576 Pb/s (NVLink)
FP16 Peak Compute	47 PFLOPS	75 PFLOPS	125 PFLOPS	~2 PFLOPS
System Product	CS-1	CS-2	CS-3	DGX H100

2.2 Key Changes in Generational Evolution

WSE-1 (2019): The first-generation commercial wafer-scale chip, integrating 400,000 cores and 1.2 trillion transistors on a 16nm process. 18 GB of on-chip SRAM provided 9 PB/s of bandwidth and 47 PFLOPS of compute. The CS-1 system, a 19-inch rack-mountable device, proved the commercial viability of the wafer-scale approach. Initial customers included life sciences institutions like GlaxoSmithKline and AstraZeneca, as well as U.S. national laboratories.

WSE-2 (2021): The process node jumped to 7nm, transistor count doubled to 2.6 trillion, and core count increased to 850,000. 40 GB SRAM with 20 PB/s bandwidth pushed compute to 75 PFLOPS. The WSE-2 entered the Computer History Museum’s collection, named “The Biggest Chip In the World.” The CS-2 system supported training models exceeding 120 trillion parameters for the first time and underpinned the Andromeda (16 interconnected units, 1 ExaFLOP) and Condor Galaxy series supercomputers.

WSE-3 (2024): 5nm process, 4 trillion transistors, 900,000 cores, 44 GB SRAM (SRAM growth approached saturation — only a 10% increase from WSE-2 to WSE-3, while transistor count grew 54%), 21 PB/s bandwidth, 125 PFLOPS. The CS-3, in a 15U chassis and 23 kW power consumption, doubled performance at the same power envelope as the WSE-2 [2][4]. It was named one of Time magazine’s Best Inventions of 2024.

2.3 CS-3 System Specifications

Specification Item	Parameter
Processor	WSE-3 (5nm, 4 Trillion Transistors, 900,000 Cores)
Peak Compute	125 PFLOPS (FP16)
On-Chip Memory	44 GB SRAM (21 PB/s Bandwidth)
External Memory Expansion	MemoryX (1.5 TB ~ 1.2 PB)
Cluster Scalability Limit	2,048 Nodes (256 ExaFLOPs)
Cooling	Proprietary Water Cooling (100 L/min, 20 C)
Power Consumption	~23 kW
Form Factor	15U Rack
Model Capacity	Up to 24 Trillion Parameters

2.4 Condor Galaxy Supercomputing Network

Cerebras collaborated with Abu Dhabi’s G42 group to deploy the Condor Galaxy (CG) series supercomputers:

System	Announcement Date	Peak Compute	Total Wafer Cores	Location
CG-1	July 2023	4 ExaFLOPs	54 Million	USA
CG-2	November 2023	4 ExaFLOPs	54 Million	USA
CG-3	March 2024 (Groundbreaking)	8 ExaFLOPs	58 Million	Dallas
Full Network Aggregate	-	16 ExaFLOPs	166 Million	Cross-Region

2.5 Scientific Computing Benchmark: Molecular Dynamics Simulation

In collaboration with Sandia National Laboratories, Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory (LANL), and the U.S. National Nuclear Security Administration (NNSA), researchers successfully simulated high-precision interactions between 800,000 atoms on the WSE-2 [3][20]. The simulation computed with a time step of 1 femtosecond (10^-15 seconds), and a single step on the WSE-2 took only microseconds. Its speed significantly surpassed Frontier, then the world’s top supercomputer built on traditional nodes, demonstrating the wafer-scale architecture’s innate hardware suitability for the extreme local real-time feedback demands of simulating strongly coupled physical systems.

3. Breaking the Reticle Limit: Scribe Line Stitching and Wafer-Scale Lithography

3.1 Physical Constraint: The Reticle Limit

The core limitation in semiconductor lithography comes from the Field of View of the optical lens. For current mainstream Deep Ultraviolet (DUV) and Extreme Ultraviolet (EUV) lithography machines, the maximum pattern area printable in a single exposure is limited by the physical size of the reticle (photomask) — typically 26 mm x 33 mm, approximately 858 mm². This means that, regardless of design, the physical area of any single-exposure die cannot exceed this limit. Traditional chip manufacturers use a Step-and-Repeat process to expose the same pattern multiple times across the wafer, subsequently mechanically cutting along scribe lines to divide the wafer into dozens or hundreds of individual chips.

3.2 Reticle Stitching Process Details

TSMC executes a standard step-and-repeat lithography process on a 300 mm wafer, printing a total of 84 identical dies (each approximately 858 mm², arranged in an 8x10.5 grid). Unlike traditional processes, after standard exposure is complete, Cerebras adds extra lithography steps to fabricate miniature metal wires spanning the scribe line regions in the upper metal layers. These wires, less than 1 mm in length and running in the mid-to-upper levels of the on-chip metallization stack, physically connect the on-chip interconnect network (2D Mesh Fabric) of all 84 dies into a single, continuous plane.

This cross-die interconnect system comprises over 1 million wires. The protocol stack layer includes built-in redundancy mechanisms for defective wires (spare wires + automatic rerouting). From the compiler and software perspective, the boundaries of these 84 dies do not exist — the entire wafer presents as a unified, continuous 2D Mesh compute plane.

Technical Cost: Reticle stitching increases the number of photomasks and production steps, making the wafer manufacturing cost higher than that of a standard GPU wafer. However, Cerebras’s argument is that this additional cost is offset by the elimination of off-chip packaging, inter-chip connectivity, and system integration costs enabled by wafer-scale integration.

3.3 Core Micro-Design: The Physical Significance of a 0.05 mm² Core

A single AI-optimized core in the WSE-3 occupies an area of 0.05 mm² — approximately 1/120th the area of a Streaming Multiprocessor (SM, ~6 mm²) in an NVIDIA H100. This extremely small core size has multiple physical implications:

Minimized Defect Cost: At a given defect density, a single defect renders only 0.05 mm² of silicon area non-functional, instead of 6 mm² — a 120x reduction in the economic cost of defects.
Fine-Grained Redundancy: Within a fixed silicon area, a far greater number of physical cores than functionally required can be integrated, providing ample redundant spares.
Short Interconnect Latency: The physical distance between cores is on the order of tens of micrometers, with signal propagation delay being just 1 clock cycle.

Within this tiny 0.05 mm² space, the silicon area allocation is roughly: about 50% for a 48 KB single-cycle SRAM, and the remaining 50% for general-purpose tensor and sparse algebra computation logic composed of approximately 110,000 standard gates. A single core’s peak power consumption at a frequency of 1.1 GHz is only 30 mW.

4. Defect Tolerance and Yield

4.1 The Physical Reality of Defect Density

The typical defect density for TSMC’s 5nm process is approximately 0.001 defects per mm² (data for mature nodes). On the 46,225 mm² WSE-3, this density translates to roughly 46 random physical defects per wafer. In traditional chip manufacturing, any single one of these defects falling within the active area of a chip renders the entire die non-functional — this is the fundamental reason why, for 75 years, chips have been made smaller, not larger.

4.2 Cerebras’s Solution: A Three-Layer Mechanism for 100x Defect Tolerance

Layer 1: Core Miniaturization. A single core area of 0.05 mm² versus the ~6 mm² of an H100 SM creates an asymmetry in defect cost. At the identical defect density of 0.001 defects/mm²: one defect on WSE-3 has a 50% probability of landing inside a core area, with an expected loss of 0.025 mm² of silicon area; one defect on H100 has a 99.8% probability of landing inside an SM area, with an expected loss of approximately 3 mm² of silicon area. Extrapolating from this:

Layer 2: Physical Redundancy. WSE-3 physically integrates 970,000 cores on the wafer but nominally enables only 900,000. The 70,000 extra cores (approximately 7.2% physical redundancy) provide ample spare capacity.

Layer 3: Fail-in-Place Resilient Routing. During the chip’s power-on initialization phase, test logic identifies the locations of all defective cores. The on-chip reconfigurable interconnect network then automatically bypasses the failed cores, remapping neighboring healthy cores to the corresponding positions in the logical grid. This process is completed entirely automatically in hardware and is transparent to the software layer.

The net effect of this three-layer mechanism: The effective active silicon area ratio of the WSE-3 reaches approximately 93% (900,000 / 970,000), achieving a usable yield at commercial scale comparable to diced chip processes. Cerebras’s core insight is that solving the yield problem does not depend on reducing defects, but on making the economic cost of each defect approach zero.

5. Micro-Core Architecture and On-Chip Dataflow Network

5.1 Internal Structure of a Compute Core

Each WSE core internally contains:

48 KB single-cycle SRAM, using an 8-Bank split architecture (6 KB per Bank, 32-bit width), supporting simultaneous conflict-free access of 2x 64-bit reads + 1x 64-bit write per clock cycle.
256 Bytes of software-managed cache, specifically for storing high-frequency changing data structures like accumulators.
Compute logic composed of 110,000 standard gates, supporting tensor multiply-accumulate and sparse matrix operations.
Native sparse triggering in the instruction set: upon detecting an input weight is zero, it automatically skips the multiply-accumulate operation, yielding a several-fold effective speedup when processing highly sparse large language models.

5.2 On-Chip Interconnect Network Architecture

The WSE builds a high-speed interconnect network based on a 2D Mesh topology. Each core integrates a 5-port structural router (East, West, South, North, Local), supporting bidirectional 32-bit single-cycle data transfer. Each physical transmission packet consists of 16-bit compute data + 16-bit index data, perfectly matching the coordinate addressing requirements of sparse matrix computation.

Network communication is divided at the hardware level into 24 independently configurable static routing colors (Colors). Each color has hardware-isolated dedicated buffer queues, sharing the physical bus via Time-Multiplexing for non-blocking transmission. The on-chip Fabric natively supports hardware-level single-cycle Broadcast and Multicast. Since physical wires between cores are only tens of micrometers, cross-core signal latency is just 1 clock cycle (~0.9 ns at 1.1 GHz).

Fundamental Architectural Difference from GPU: The WSE uses a Dataflow Architecture — computation is driven by the arrival of data. 32-bit wavelet messages travel through the 2D grid; the wavelet’s 5-bit color tag determines the routing path and trigger task. When a wavelet arrives at a specific color channel, the bound task is launched for execution. If the weight is zero, no wavelet is emitted, achieving unstructured sparsity acceleration. In contrast, GPUs use a control flow architecture (SIMT/Warp) — execution is driven by the program counter; all 32 threads in the same warp execute the same instruction and cannot skip zero-value computations.

5.3 On-Chip Memory Hierarchy

Level	Medium	Capacity	Aggregate Bandwidth	Latency	Physical Location
L0 - Register	Core-Private	256 B	5.3 TB/s (Single Core Peak)	1 cycle	Inside Core
L1 - SRAM	Core-Private	48 KB x 900,000 = 44 GB	21 PB/s (Full Chip Aggregate)	1 cycle	Inside Core
L2 - MemoryX	DRAM + Flash	1.5 TB ~ 1.2 PB	Proprietary Protocol	High	External Cabinet
L3 - SwarmX	Switch Network	Cluster-Level	Broadcast/Reduce Hardware Acceleration	Topology-Dependent	Cluster Interconnect

The 21 PB/s aggregate bandwidth of the 44 GB on-chip SRAM in the WSE-3 fundamentally changes the compute economics. Compared to the 3 TB/s of H100’s HBM3, the difference is a factor of 7,000. More critically, SRAM bandwidth scales linearly with capacity (each bank can be read in parallel by adjacent compute units), whereas HBM bandwidth is limited by the number of physical channels — this is an architectural difference, not one that can be bridged by process advancements alone.

6. Power Delivery, Thermal Management, and Thermal Expansion Mechanical Compensation Engineering

6.1 Power Delivery Challenge: Voltage Consistency at 23 kW

The WSE-3 has a full-load rated power consumption of 23 kW, with an operating voltage of approximately 0.8-0.9V (sub-volt level), requiring a continuous current injection of roughly 28,750 to 30,000 Amps. In a traditional horizontal two-dimensional power delivery architecture, power is routed from the chip edge through lateral bus bars on the PCB. Due to the physical impedance of metal wiring, a current of 30 kA level crossing a 215 mm wafer would generate a catastrophic IR Drop — theoretically, the voltage drop from edge to center could reach 9.6V, while the chip’s operating voltage is only 1V, making it impossible to power cores in the central region.

Solution: 3D Vertical Power Delivery. Cerebras placed a custom multi-layer high-density power distribution PCB directly behind the wafer, embedding over 300 high-frequency step-down Voltage Regulator Modules (VRMs). Current is projected perpendicularly to the wafer surface directly onto micro-electrical contacts on the back of each core, over a physical distance of only a few millimeters. Each of the 84 die areas has its voltage regulated independently, completely eliminating lateral IR Drop. The entire power delivery network is encapsulated within a four-layer physical sandwich called the Engine Block: Cold Plate, Wafer, Custom Compliant Connector, Power PCB.

6.2 Coefficient of Thermal Expansion Mismatch and Mechanical Compensation

The core packaging challenge faced by the system originates from the Coefficient of Thermal Expansion (CTE) mismatch of heterogeneous materials:

Material	CTE (ppm/C)	Corner Displacement under 65 C Rise (215 mm x 215 mm)
Silicon	2.6	~36 µm
FR-4 PCB	15 (Lateral)	~210 µm
Copper (Cold Plate)	17	~238 µm

The expansion of the PCB under a 65 C temperature rise is roughly 5.8 times that of silicon. For traditional packaging (BGA, flip-chip, wire bonding), a 122 µm relative corner displacement (PCB vs. Si) already exceeds their failure threshold by a factor of 5-7.

Solution: Co-founder Jean-Philippe Fricker led the design of a custom Compliant Elastomeric Connector. This connector, sandwiched between the wafer and the PCB, maintains good electrical conductivity in the vertical direction while possessing high physical compliance and deformation resilience in the horizontal shear direction. When temperature differences cause the PCB to expand more than the silicon, the elastic connector layer absorbs all shear stress through microscopic physical shear deformation, ensuring contact reliability for hundreds of thousands of power and signal pins.

Additionally, a dynamic Ambulating Thermal Interface (ATI) material is embedded between the cold plate and the back of the wafer. Composed of a high thermal conductivity material and a physical friction-reducing material laminated together, it allows the water-cooled copper plate to undergo micron-level lossless horizontal sliding against the silicon surface during thermal deformation, preventing stress transfer that could physically fracture the silicon wafer.

No off-the-shelf automated equipment could precisely handle such a large-area, fragile, heterogeneous 3D layered assembly. Cerebras designed and built dedicated high-precision alignment and pressure assembly machines from scratch to close the mechanical structure loop for wafer-scale components.

6.3 Liquid Cooling System

The system employs Direct-to-Chip Liquid Cooling. Dual-redundant industrial high-pressure pumps inject cooling water at 20 +/- 2 C, at a flow rate of 100 +/- 10 L/min, into a brass manifold cold plate conforming to the wafer surface. The interior of the cold plate is machined with micro-fin channels to maximize heat exchange surface area.

At the data center level, row-based and in-rack high-precision fluid manifold controls are deployed, with digital monitoring of flow and pressure to eliminate stagnant zones. The CSoft software layer runs dynamic duty-cycle dummy operations (Power Ramp Smoothing via Dummy Operations) when no computational workload is active, smoothing power transients to stay within electrical safety boundaries and preventing potential physical damage to the wafer-scale system from instantaneous, drastic voltage fluctuations.

7. CSoft Compilation Pipeline and Execution Modes

7.1 Compiler Hierarchy

The core of the Cerebras CSoft software platform is the Cerebras Graph Compiler (CGC), responsible for losslessly mapping a PyTorch/TensorFlow computational graph onto the physical grid of 900,000 cores. The compilation pipeline follows a stepwise Lowering logic:

PyTorch Model – Lazy Tensor Backend (ATen Operator Graph Capture) – XLA HLO (High-Level Optimization Custom Calls) – CIRH (Cerebras IR High, an MLIR dialect, full-graph level rewrite passes) – Operator Deep Fusion, Constant Folding, Common Subexpression Elimination, Dead Code Pruning – Pattern Matching against a Pre-built Library of Hand-Tuned High-Performance Kernels –【Match Success】Directly Generate Optimized Instructions –【Match Failure】CLAIR/LAIR (Low-Level Linear Algebra IR) – AutoGen Automated Kernel Compiler – Polyhedral Space Transformation Mathematical Optimization – 2D Core Topology-Aware Placement – 8-Bank SRAM Allocation – Final Physical Instruction Machine Code.

The AutoGen kernel compiler supports four strategies: default, disabled, medium, and aggressive, providing adaptive kernel generality for customized development.

Key Constraint: The computational graph must be a Static Graph — dynamic shapes or data-dependent branching are not supported. The routing table is fixed once at compile time for all 900,000 cores.

7.2 Two Execution Modes

Layer-Pipelined Mode - Model Fully Resident On-Chip:

flowchart LR
    Input[Input Data Stream] --> L1[WSE Partition 1
Layer 1]
    L1 --> L2[WSE Partition 2
Layer 2]
    L2 --> Ldots[...]
    Ldots --> LN[WSE Partition N
Layer N]
    LN --> Output[Output]

Characteristics: Model parameters are fully resident on-chip. Multiple micro-batches run concurrently, spatially interleaved across the wafer. The compiler must solve a VLSI floorplanning problem — one Cerebras proved to be NP-hard at ISPD 2020. Applicable for models whose parameters fit within the 44 GB SRAM.

Weight Streaming Mode - Current Default:

flowchart LR
    subgraph MemoryX[External MemoryX Storage]
        W1[Layer 1 Weights]
        W2[Layer 2 Weights]
        WN[Layer N Weights]
    end
    W1 --> WS[WSE Full Chip
900,000 Cores
Single Layer Computation]
    W2 --> WS
    WN --> WS
    WS --> Grad[Gradient Return]
    Grad --> MemoryX

Characteristics: The entire wafer processes a single layer at a time. Weights reside in external MemoryX (DRAM + Flash, up to 1.2 PB) and are streamed to the wafer layer-by-layer. All 900,000 cores process the same layer; activations remain on-chip, and weights are discarded after computation. Scaling to 2,048 systems requires only changing a single flag.

7.3 The Order-of-Magnitude Difference in Software Complexity

Training a 175 billion parameter large model on a GPU cluster typically requires ~20,000 lines of distributed training code (a combination of Tensor Parallelism + Pipeline Parallelism + FSDP + DeepSpeed + Megatron-LM). Cerebras claims an equivalent scale requires just 565 lines of PyTorch code, and that the software complexity of training a 1 trillion parameter model is comparable to training a 1 billion parameter model on GPUs — a frequently underestimated competitive advantage.

7.4 Developer Interfaces

AI Inference Users: OpenAI-compatible API, zero learning curve.
Model Training Users: Standard PyTorch / TensorFlow frameworks; CSoft handles the low-level details.
HPC Developers: CSL SDK — a Zig-based DSL allowing programming of individual cores, manual configuration of routing tables, and adaptation of code and data within the 48 KB memory. There are no thread concepts, no shared memory, no kernel launches, but also no need to handle synchronization or race conditions.

8. Wafer-Scale Supercomputer Cluster Architecture

8.1 Cluster Components

Component	Function	Specification
CS-3	Single System Compute Unit	15U, 23 kW, 125 PFLOPS
MemoryX	External Weight Storage Node	1.5 TB ~ 1.2 PB
SwarmX	Cluster Switching Network	Hardware-Level Broadcast + Gradient Reduce/Sum
CSL	Cluster Interconnect Topology	Up to 2,048 Nodes, 256 ExaFLOPs
AI400X2	Parallel File Storage	90+ GB/s Sustained Bandwidth, 3M+ IOPS

8.2 DARPA Co-Packaged Optics Project

Under DARPA funding, Cerebras is collaborating with Ranovus to develop wafer-scale Co-Packaged Optics (CPO). The goal is to directly mount Ranovus’s optical fiber transceivers onto the edge of the wafer, replacing traditional electrical off-chip interconnects with a multi-wavelength, multi-mode fiber optic network. This solution can provide over 100x the data throughput capacity of conventional CPO solutions while drastically reducing the power consumption of cluster-level parameter transfer.

9. Wafer-Scale Architecture vs. NVIDIA GPU Clusters

9.1 System-Level Physical Specification Comparison

Evaluation Dimension	Cerebras CS-3	NVIDIA B200 (Single Card)	NVIDIA DGX B200 (8-Card Node)	NVIDIA GB200 NVL72 (Full Rack)
Core Physical Form	Single Wafer Integrated (15U Chassis)	Dual-Die Bridged (SXM Module)	8-Card Parallel (10U Chassis)	72-Card High-Density Parallel (Full Rack)
FP16 Peak Compute	125 PFLOPS	4.4 PFLOPS	36 PFLOPS	360 PFLOPS
On-Board Memory Capacity	44 GB SRAM (On-Chip)	192 GB HBM3e	1.5 TB HBM3e	13.5 TB HBM3e
Memory Access Bandwidth	21,000 TB/s (21 PB/s)	8.0 TB/s	64 TB/s	576 TB/s
Inter-Chip Communication Performance	On-Chip Metal Traces, Zero External Loss	SXM Physical Socket	NVSwitch On-Board Routing	9 Sets High-Speed Copper + Optical Switch
Max Rated Power Consumption	~23 kW	~1,200 W	~14.3 kW	~120 kW
Rack Space	15U	Single Slot	10U	42U
LLM Training Programming Complexity	Pure Data Parallelism, ~565 Lines of Code	Not Applicable	Tensor/Pipeline/FSDP Combination	Extremely Complex Network Topology Configuration

9.2 Training Deployment: Parameter Scale vs. Physical Requirements

Model Parameter Scale	GPU Cluster Requirement (B200)	Cerebras CS-3 Requirement
100 Billion (100B)	>=12 B200 GPUs + NVSwitch + InfiniBand	1x CS-3 + 2.4 TB MemoryX
1 Trillion (1T)	Hundreds of B200 GPUs + Fiber Interconnect Cluster	1x CS-3 + 1.2 PB MemoryX
10 Trillion (10T)	1,000+ B200 Servers	1x CS-3 + 1.2 PB MemoryX

9.3 Inference: Roofline Model Analysis

The arithmetic intensity formula for the LLM token generation (decode) phase is: Arithmetic Intensity = FLOPs / Bytes, approximately 1 FLOP/byte. The Ridge Point = Peak FLOPS / Peak Memory Bandwidth.

Chip	FP16 Peak	Memory Bandwidth	Ridge Point	Decode Arithmetic Intensity	State
H100	989 TFLOPS	3.35 TB/s	295 FLOP/byte	1 FLOP/byte	99.7% Compute Units Idle
B200	4.4 PFLOPS	8 TB/s	550 FLOP/byte	1 FLOP/byte	>99.8% Idle
WSE-3	12.5 PFLOPS	21 PB/s	0.6 FLOP/byte	1 FLOP/byte	Compute Bound

The WSE-3 is the only chip that is compute-bound during the decode phase at a batch size of 1 (Batch-1). This means that a single user request can achieve full hardware utilization, eliminating the need for GPUs to rely on large batch sizes to amortize the overhead of reading weights from HBM.

Measured Inference Performance Comparison (Independent Benchmarks):

Model	Cerebras CS-3	NVIDIA DGX B200	Advantage Factor
Llama 4 Maverick (400B)	2,500+ tok/s/user[7][12]	~1,000 tok/s/user	2.5x
gpt-oss-120B (10 Concurrency)	2,700+ tok/s	580 tok/s	4.7x
DeepSeek R1 70B	1,600 tok/s	-	-
Perplexity Sonar	1,200 tok/s	-	-

9.4 Inference TCO Comparison

Cerebras CS-3 pricing is approximately $2-3 million per node. DGX B200 (8x B200) is around $300,000. However, Cerebras claims that under an equivalent online token generation load, the CS-3’s comprehensive TCO (hardware CapEx + power OpEx) is approximately 32% lower than DGX B200, while delivering 21x faster single-token interaction speed. The core of this pricing logic is that a single CS-3 is equivalent in inference throughput to multiple DGX B200s and eliminates the significant hidden costs of InfiniBand networks, multi-rack space, and cluster management software.

10. I/O Bottleneck

10.1 I/O Bottleneck: A 133,000x Gap

On-Chip SRAM Bandwidth: 21 PB/s. Off-Chip MemoryX/Network I/O: approximately 150-200 GB/s. Gap: approximately 133,000x.

For models that fit within the 44 GB SRAM, this is not an issue — all data circulates on-chip. However, mainstream large language models are rapidly surpassing this capacity (Llama 4 Maverick 400B requires 800 GB of weights in FP16). In Weight Streaming mode, the speed at which weights for each layer are streamed from MemoryX is bottlenecked by the off-chip bandwidth, making I/O the bottleneck.

The collaboration model with AWS — where Trainium handles Prefill and Cerebras handles Decode — is essentially an implicit acknowledgment of the chip’s insufficient efficiency during the compute-bound prefill stage.

10.2 Stagnation in SRAM Density Growth

The area of a high-density SRAM bitcell has essentially stalled at approximately 0.021 µm² from the 5nm node down to 3nm and even 2nm nodes. SRAM scaling faces physical limits (leakage and stability constraints of the 6T cell). Concurrently, HBM is progressing from HBM3 (~5 Gb/s/pin) towards HBM4 (8+ Gb/s/pin), with stack layers increasing from 12 to 16, and capacity heading towards 1+ TB by 2028. Cerebras’s SRAM capacity advantage will face a structural challenge within 2-3 generations.

10.3 The Undisclosed Throughput Crossover Point

Cerebras’s headline-grabbing tok/s figures are single-user speeds. GPUs boost aggregate throughput by batch processing users — serving multiple concurrent requests from the same weight read: at a batch size of 10-20, a single H100’s aggregate throughput may already match a single CS-3; at a batch size of 128+, a DGX H100 system can generate thousands of aggregate tok/s at significantly lower hardware cost. Cerebras has never published its aggregate throughput figures under high-concurrency scenarios. This is the most significant missing data point in publicly available materials.

11. Business Closure and Capital Landscape

11.1 Revenue Growth Trajectory

Year	Revenue	YoY Growth	Key Drivers
2022	$24.6 Million	-	Early deployments for life sciences customers
2023	$78.7 Million	220%	CS-2 shipments began
2024	$290.3 Million	269%	CG-1/2 delivery + G42 prepayment momentum
2025	$510 Million	76%	Large MBZUAI order + Inference cloud launch

Quarterly trend acceleration is notable: Q1 2025 was $99.5 Million, and Q4 2025 reached $171.4 Million (an annualized run rate of ~$686 Million).

11.2 Deconstructing GAAP Profit: Paper Gains vs. Substantive Losses

Metric	2024	2025	Change
GAAP Net Income (Loss)	($481.6 Million)	$237.8 Million	Turned Profitable
Includes: One-Time Non-Cash Gain from Extinguishment of Forward Contract Liabilities	-	$363.3 Million	Paper Adjustment
Stock-Based Compensation (SBC)	-	$49.8 Million	Actual Cash Expense
Non-GAAP Operating Net Loss	($21.8 Million)	($75.7 Million)	Loss Widened by 247%
Operating Cash Flow	$452 Million	($10 Million)	Turned from Positive to Negative

Key Insight: The $237.8 million GAAP net profit in 2025 is a paper profit, driven by a one-time non-cash gain resulting from the recapitalization of forward purchase contract liabilities with G42. Excluding this, the Non-GAAP operating loss widened from $21.8 million to $75.7 million — a 247% increase in losses. The reason operating cash flow turned from positive to negative is that 2024 included $640.3 million in customer prepayments from G42 (recorded positively in operating cash flow), whereas in 2025, positive prepayments decreased and capacity previously sold was being delivered against.

11.3 Gross Margin Structure

Business Line	2024	2025	Q1 2025	Q2 2025	Q3 2025	Q4 2025
Consolidated Gross Margin	42%	39%	-	-	-	-
Hardware Gross Margin	-	43%	-	-	-	-
Cloud Services Gross Margin	-	30%	68%	26%	16%	21%

Cloud gross margin plummeted from 68% in Q1 to 16% in Q3, reflecting severe underutilization of capacity in newly built data centers. Cerebras is shifting from a high-margin hardware sales model to a lower-margin cloud services model (the business model of the OpenAI contract). The degree of gross margin convergence in this structural shift will determine the company’s long-term earnings power.

11.4 IPO Pricing History

Date	Event	Price Range / Pricing	Notes
2025.10	Initial Confidential S-1 Filing	Terminated	Stalled due to CFIUS review
2026.05.04	Amended S-1 Public Filing	$115-$125	28 Million Shares
2026.05.10	First Upward Revision	$125-$135	Oversubscription exceeded expectations
2026.05.11	Second Upward Revision	$150-$160	30 Million Shares
2026.05.13	Final Pricing	$185	20x Oversubscribed
2026.05.14	First Day Open	$350	+89% vs. Pricing
2026.05.14	First Day Close	$311	Fully Diluted Valuation ~$48.8 Billion

The consistency from the pre-IPO secondary market (Hiive) average trading price of $187.53 to the IPO pricing of $185 indirectly confirms the market’s high expectations for Cerebras’s concept of a differentiated AI processor.

11.5 Funding History

Date	Round	Amount	Per Share	Implied Valuation
2016.5	Series A	$27 Million	-	-
2018.11	Series D	$88 Million	-	Unicorn Status
2019.11	Series E	$270 Million	-	$2.4 Billion
2021.11	Series F	$250 Million	-	Over $4.0 Billion
2024.7-9	Series F-1	$85 Million	$14.66	-
2025.9	Series G	$1.1 Billion	$36.23	$8.1 Billion
2026.1	Series H	$1.0 Billion	$89.01	-
2026.5.14	IPO	$5.55 Billion	$185	Opened ~$48.8 Billion

The jump from Series H ($89.01) to IPO ($185) in just 4 months marks a 108% increase. From Series F-1 ($14.66) to IPO ($185) represents a 1,162% increase over 22 months.

11.6 Key Terms of the OpenAI Agreement

Total Contract Value: Over ~$20 Billion (publicly stated as a $10 Billion base + expandable to 2 GW).
Compute Scale: 750 MW (Base), up to 2,000 MW (Optional).
Model: Pure cloud capacity subscription (not hardware sales).
Operating Loan: OpenAI provides Cerebras with a $1 Billion working capital loan.
Equity: OpenAI receives warrants to subscribe for Class N non-voting common stock, allowing it to hold up to 10% equity in Cerebras upon completion of certain compute deployment milestones.
Exclusivity Clause: The contract restricts Cerebras from selling products to Anthropic.

11.7 AWS Bedrock Collaboration

CS-3 systems are deeply integrated as the underlying compute engine into the AWS Bedrock managed inference service, with Trainium handling Prefill and Cerebras handling Decode. This partnership opens a compliant commercial pathway for Cerebras to reach hundreds of thousands of small and medium business customers and enterprise developers, while also substantively substituting its largest geographic customer concentration risk (reliance on a single UAE client) with a U.S.-based alternative.

12. Customer Concentration, Geopolitical Risk, and Competitive Landscape

12.1 Customer Concentration Data

Customer	2024 Revenue Share	2025 Revenue Share	2025 Accounts Receivable Share	Nature
G42	85%	24%	-	Abu Dhabi Tech Holding, Strategic Investor
MBZUAI	Not Material	62%	77.9%	UAE AI University, G42 Affiliate
Total	85%	86%	-	Two Customers Represent 86% of Total Revenue

12.2 The Multifaceted Nature of the G42 Relationship

G42 is simultaneously Cerebras’s customer (made $640 Million in prepayments), supplier (compute collaboration), partner (Condor Galaxy series supercomputer co-operator), investor (participated in Series G/H + holds 3.5 million shares), and a related party (as defined by ASC 850). In 2024, G42 was granted a warrant to purchase 1,857,516 shares of Class N common stock at an exercise price of $0.01 per share.

12.3 Controversial Terms of the OpenAI Warrants

OpenAI received 33.4 million warrants at a symbolic price of $0.01 per share (virtually free) — forming an extreme contrast with the IPO price of $185 and the Series H price of $89.01. The scale of value transfer from these warrants ranges between $3 Billion and $6 Billion (depending on market price), a highly unusual arrangement historically.

12.4 Competitive Landscape: Cerebras vs. Groq vs. SambaNova

Dimension	Cerebras	Groq	SambaNova
Core Idea	Wafer-Scale Integration (Giant Chip)	LPU (Language Processing Unit)	Reconfigurable Dataflow Architecture
Chip Area	46,225 mm²	Standard Single Chip	Standard Single Chip
Process Node	TSMC 5nm	Proprietary LPU (14nm)	TSMC 5nm (SN50)
Memory Strategy	On-Chip SRAM 44 GB	SRAM (Deterministic Latency)	Hierarchical Reconfigurable
Core Selling Point	Memory Bandwidth / Dual-Use Training & Inference	Extremely Low & Deterministic Latency	Flexible Dataflow Programming
Funding/Valuation	IPO ~$48.8 Billion	~$1 Billion+	$4 Billion

12.5 NVIDIA’s Structural Advantages

Although Cerebras holds a clear upper hand in inference latency and memory bandwidth, NVIDIA’s advantages in the following dimensions are difficult to shake in the short term:

CUDA Ecosystem: 4 million+ developers, the broadest framework support, the most mature model optimization libraries.
Workload Flexibility: Same hardware used for both training and inference, supporting any model architecture.
Supply Chain Maturity: Global OEM system integrators, a spare parts market, and well-established enterprise operation and maintenance processes.
Continuous Memory Capacity Growth: HBM3e to HBM4 (2026) to 1+ TB in 2028, while SRAM density growth is stagnating.
Community Effects: All new models debut on CUDA/CuDNN; Cerebras requires labor-intensive manual adaptation per model.

13. Summary and Outlook

Cerebras’s wafer-scale chip represents a unique achievement in the history of semiconductor engineering since the invention of the microprocessor. Gene Amdahl challenged WSI in 1980 with Trilogy Systems and ended in devastating failure. Over forty years later, the same problem has been given engineering-grade solutions across five critical dimensions — defect tolerance, reticle stitching, thermal expansion compensation, vertical power delivery, direct liquid cooling — making WSI a commercial reality for the first time.

From a technical perspective, the WSE-3’s 21 PB/s on-chip SRAM bandwidth lowers the Ridge Point for the LLM decode phase to 0.6 FLOP/byte, making a batch size of 1 compute-bound — an impossibility on GPUs. This architectural characteristic arrives at a propitious time in the 2025-2026 rise of the Inference Economy: as inference surpasses training to become AI’s core computational bottleneck, Cerebras’s unique architectural advantage is unlocked.

From a business perspective, Cerebras’s revenue grew from $24.6 million in 2022 to $510 million in 2025 [10][13] (a 20x increase over 3 years), raised $5.55 billion in its IPO, and reached a first-day valuation of $48.8 billion — a historic capital market validation of the WSI technology route. However, the Non-GAAP operating loss widened from $21.8 million to $75.7 million (+247%), cloud gross margin plunged from 68% to 16%, 86% of revenue depends on two related-party UAE customers, and OpenAI received 33.4 million warrants at $0.01/share on out-of-the-money terms — these figures demand calm amid the euphoria.

Cerebras’s technology route will not replace GPUs, but it has established a clear competitive moat in a specific and increasingly important market segment: large model inference. Key observation points for the next 2-3 years:

The actual delivery cadence and gross margin for the >$20 Billion OpenAI contract.
CS-3 aggregate throughput / TCO data under high-concurrency scenarios (currently missing).
The relative race between SRAM density growth and HBM capacity expansion.
Whether the implementation of Co-Packaged Optics (CPO) technology can shrink the I/O bottleneck by 1-2 orders of magnitude from the current 133,000x gap.
The change in customer concentration after CFIUS clearance allows more U.S.-based enterprise customers to be onboarded.

14. References

Cerebras Official Chip Page. https://www.cerebras.ai/chip
Cerebras WSE-3 Press Release (March 2024). https://www.cerebras.ai/press-release/cerebras-announces-third-generation-wafer-scale-engine
Wikipedia - Cerebras Systems. https://en.wikipedia.org/wiki/Cerebras_Systems
IEEE Spectrum - Cerebras WSE-3: Third Generation Superchip for AI (March 2024). https://spectrum.ieee.org/cerebras-chip-cs3
arXiv - A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems (March 2025). https://arxiv.org/html/2503.11698v1
Peak FLOPS Substack - Breaking down the Cerebras Wafer Scale Engine (April 2026). https://wafer.substack.com/p/breaking-down-the-cerebras-wafer
Introl Blog - Cerebras Wafer-Scale Engine: When to Choose Alternative AI Architecture (April 2026). https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
TechCrunch - The five technical challenges Cerebras overcame (August 2019). https://techcrunch.com/2019/08/19/the-five-technical-challenges-cerebras-overcame-in-building-the-first-trillion-transistor-chip/
TechCrunch - 600 billion dollar AI chip darling Cerebras almost died early on, burning 8 million dollars a month (May 2026). https://techcrunch.com/2026/05/16/
Mostly Metrics - Cerebras IPO S-1 Breakdown (April 2026). https://www.mostlymetrics.com/p/cerebras-ipo-s1-breakdown
Cerebras Blog - 100x Defect Tolerance: How Cerebras Solved the Yield Problem. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
Cerebras Blog - Cerebras CS-3 vs. Nvidia DGX B200 Blackwell (September 2025). https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-dgx-b200-blackwell
SEC.gov - Cerebras S-1 Registration Statement (April/May 2026). https://www.sec.gov/Archives/edgar/data/2021728/000162828026025762/cerebras-sx1april2026.htm
Forbes - Cerebras, Groq And SambaNova Line Up To Compete With Nvidia (October 2025). https://www.forbes.com/sites/karlfreund/2025/10/21/cerebras-groq-and-sambanova-line-up-to-compete-with-nvidia/
Reuters - Cerebras shares skyrocket in debut (May 2026). https://www.reuters.com/legal/transactional/cerebras-set-debut-stock-market-gripped-by-ai-mania-2026-05-14/
Sacra Research - Cerebras vs Nvidia. https://sacra.com/research/cerebras-vs-nvidia/
TechCrunch - Cerebras raises 5.5 billion dollars, then stock pops 108% (May 2026). https://techcrunch.com/2026/05/14/cerebras-raises-5-5b-kicking-off-2026s-ipo-season-with-a-bang/
Chip Yield Analysis Tool - Cerebras WSE-3 Wafer-Scale Yield Analysis. https://blackyabhishek.github.io/analysis/cerebras_yield_analysis.html
Cerebras Blog - Supporting PyTorch on the Cerebras Wafer-Scale Engine (April 2022). https://www.cerebras.ai/blog/supporting-pytorch-on-the-cerebras-wafer-scale-engine
Cell/Device Journal - Performance, efficiency, and cost analysis of wafer-scale AI (2025). https://www.cell.com/device/fulltext/S2666-9986(25)00147-4
Hot Chips 2024 - Cerebras Wafer-Scale AI Presentation. https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf
Cerebras Blog - How Cerebras Solved the Yield Problem. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
Cerebras and AWS Collaboration Press Release (March 2026). https://www.cerebras.ai/press-release/awscollaboration

1. Overview#

2. Product Evolution: From WSE-1 to WSE-3#

2.1 Core Parameter Comparison Across Three Generations of Wafer-Scale Engines#

2.2 Key Changes in Generational Evolution#

2.3 CS-3 System Specifications#

2.4 Condor Galaxy Supercomputing Network#

2.5 Scientific Computing Benchmark: Molecular Dynamics Simulation#

3. Breaking the Reticle Limit: Scribe Line Stitching and Wafer-Scale Lithography#

3.1 Physical Constraint: The Reticle Limit#

3.2 Reticle Stitching Process Details#

3.3 Core Micro-Design: The Physical Significance of a 0.05 mm² Core#

4. Defect Tolerance and Yield#

4.1 The Physical Reality of Defect Density#

4.2 Cerebras’s Solution: A Three-Layer Mechanism for 100x Defect Tolerance#

5. Micro-Core Architecture and On-Chip Dataflow Network#

5.1 Internal Structure of a Compute Core#

5.2 On-Chip Interconnect Network Architecture#

5.3 On-Chip Memory Hierarchy#

6. Power Delivery, Thermal Management, and Thermal Expansion Mechanical Compensation Engineering#

6.1 Power Delivery Challenge: Voltage Consistency at 23 kW#

6.2 Coefficient of Thermal Expansion Mismatch and Mechanical Compensation#

6.3 Liquid Cooling System#

7. CSoft Compilation Pipeline and Execution Modes#

7.1 Compiler Hierarchy#

7.2 Two Execution Modes#

7.3 The Order-of-Magnitude Difference in Software Complexity#

7.4 Developer Interfaces#

8. Wafer-Scale Supercomputer Cluster Architecture#

8.1 Cluster Components#

8.2 DARPA Co-Packaged Optics Project#

9. Wafer-Scale Architecture vs. NVIDIA GPU Clusters#

9.1 System-Level Physical Specification Comparison#

9.2 Training Deployment: Parameter Scale vs. Physical Requirements#

9.3 Inference: Roofline Model Analysis#

9.4 Inference TCO Comparison#

10. I/O Bottleneck#

10.1 I/O Bottleneck: A 133,000x Gap#

10.2 Stagnation in SRAM Density Growth#

10.3 The Undisclosed Throughput Crossover Point#

11. Business Closure and Capital Landscape#

11.1 Revenue Growth Trajectory#

11.2 Deconstructing GAAP Profit: Paper Gains vs. Substantive Losses#

11.3 Gross Margin Structure#

11.4 IPO Pricing History#

11.5 Funding History#

11.6 Key Terms of the OpenAI Agreement#

11.7 AWS Bedrock Collaboration#

12. Customer Concentration, Geopolitical Risk, and Competitive Landscape#

12.1 Customer Concentration Data#

12.2 The Multifaceted Nature of the G42 Relationship#

12.3 Controversial Terms of the OpenAI Warrants#

12.4 Competitive Landscape: Cerebras vs. Groq vs. SambaNova#

12.5 NVIDIA’s Structural Advantages#

13. Summary and Outlook#

14. References#