1. Overview
Groq, Inc. is an AI chip company headquartered in Mountain View, California, founded in 2016 by former Google TPU core designer Jonathan Ross [1]. The company’s processor architecture was initially called the Tensor Streaming Processor (TSP), later rebranded as the Language Processing Unit (LPU) during the large language model wave of 2023-2024 [2].
Groq’s core philosophy rests on a radical and elegant design choice: discard all non-deterministic hardware mechanisms accumulated over forty years in the computing industry, and hand execution scheduling authority entirely to the compiler [3]. In traditional CPUs and GPUs, cache hierarchies, branch prediction, out-of-order execution, and dynamic scheduling are core mechanisms for boosting average performance, but they also introduce latency unpredictability. Groq’s design team realized that for inference workloads—whose computation graphs are known and fixed at runtime—these mechanisms are not merely superfluous, but actively harmful.
This choice enabled Groq’s LPU to achieve unprecedented levels of LLM inference latency. In early 2024, Groq performed so outstandingly in independent benchmarks by ArtificialAnalysis.ai that the testers were forced to extend the chart axes just to fit Groq’s data points on the graph [4]. On Llama 2 70B, Groq achieved an output speed of approximately 300 token/s, roughly 10x faster than a comparable NVIDIA H100 cluster [5].
On December 24, 2025, NVIDIA reached a technology licensing and talent acquisition agreement with Groq for approximately $20 billion [6][7]. In March 2026, at GTC, NVIDIA unveiled the first chip born from this collaboration: the Groq 3 LPU (LP30)—built on Samsung 4nm process, integrating 512 MB of on-chip SRAM, delivering 150 TB/s of memory bandwidth—serving as a dedicated decode-stage co-processor within the Vera Rubin platform, with shipments planned for Q3 2026 [8].
Key Timeline:
- 2016: Company founded, Jonathan Ross leaves Google
- 2017: $10M seed round from Social Capital
- 2021: $300M Series C (Tiger Global, D1 Capital), valuation exceeds $1B
- 2022: Acquisition of Maxeler Technologies (dataflow computing)
- 2023: Samsung 4nm production line selected, TSP rebranded as LPU
- 2024.02: GroqCloud developer platform launched
- 2024.03: Acquisition of Definitive Intelligence (Sunny Madra joins)
- 2024.08: $640M Series D (led by BlackRock), valuation $2.8B
- 2025.02: $1.5B infrastructure commitment from Saudi Arabia
- 2025.07: Revenue forecast slashed from $2B to $500M
- 2025.09: Valuation rises to $6.9B
- 2025.12: NVIDIA acquires Groq technology + talent for $20B
- 2026.03: Groq 3 LPU (LP30) unveiled at GTC
- 2026.05: Groq independent entity raises $650M, pivots to AI inference cloud
2. Historical Evolution and Founding Background
2.1 Founder and TPU
Groq’s founder, Jonathan Ross, was one of the core designers of Google’s TPU [9]. The TPU was a custom ASIC developed by Google around 2015 for internal inference workloads. Ross amassed deep experience in AI accelerator design during the TPU project—particularly the approach of deriving hardware design from domain-specific workloads. In 2016, Ross co-founded Groq with another former Google engineer, Douglas Wightman [1]. Wightman departed Groq in 2019 [10]; Ross subsequently served as CEO until the completion of the NVIDIA deal.
2.2 Funding History
| Round | Date | Amount | Key Investors | Valuation |
|---|---|---|---|---|
| Seed | 2017 | $10M | Social Capital (Chamath Palihapitiya) [11] | — |
| Series A/B | 2018 | $52M | Social Capital et al. [11] | — |
| Series C | 2021.04 | $300M | Tiger Global, D1 Capital [12] | >$1B |
| Series D | 2024.08 | $640M | BlackRock PE, Cisco, Samsung Catalyst [13] | $2.8B |
| Series D+ | 2024.09 | $750M | Disruptive, BlackRock, Neuberger Berman [14] | ~$6.9B |
| Saudi Commitment | 2025.02 | $1.5B (infrastructure) | Kingdom of Saudi Arabia [15] | — |
| Bridge Financing | 2026.05 | $650M | Disruptive, Infinitum (pro-rata) [16] | Not disclosed |
Social Capital’s Chamath Palihapitiya entered Groq with a $10M seed round in 2017—a time when Silicon Valley chip startups were considered “venture capital poison” [11]. By the time of the NVIDIA deal in 2025, this investment had multiplied into billions of dollars.
2.3 Key Acquisitions
Groq made two significant acquisitions in its history:
Maxeler Technologies (March 2022) [17]: Acquired this London-based dataflow computing company, founded by Dr. Oskar Mencer in 2003. Maxeler’s team of approximately 20 joined Groq’s London office, bringing deep expertise in FPGA dataflow systems and high-performance computing. This acquisition provided critical talent for Groq’s multi-chip scaling network design.
Definitive Intelligence (March 2024) [18]: This acquisition directly spawned the GroqCloud business unit. Definitive Intelligence’s co-founder and CEO Sunny Madra joined Groq to lead GroqCloud—he had previously founded Autonomic (acquired by Ford in 2018). Madra later became President of Groq and joined NVIDIA following the deal [7].
2.4 Early Strategy: The Unexpected Pivot from CNN to LLM
The TSP was not originally designed for large language models. Its 2020 ISCA paper primarily targeted convolutional neural networks and traditional deep learning inference [3]. Following the explosion of ChatGPT in late 2022, Groq quickly recognized its architecture’s unique advantages for transformer-based LLMs—particularly the bandwidth-sensitive, latency-deterministic nature of autoregressive decoding. During 2023-2024, the company rebranded the TSP as the Language Processing Unit (LPU) [2], shifting its market positioning from “general-purpose AI accelerator” to “LLM inference-dedicated engine.”
3. In-Depth Technical Architecture Analysis
3.1 Design Philosophy: Determinism First
The core characteristic of traditional CPU and GPU microarchitectures is non-deterministic execution. A program run twice on the same input may produce different exact instruction timings each time. The sources of this non-determinism include:
- Cache Hierarchy: The latency difference between a cache hit (~10 cycles) and a cache miss (~200 cycles) can be 20x
- Branch Prediction: Pipeline flush and rollback on misprediction wastes 10-20 cycles
- Out-of-Order Execution: Hardware dynamically reorders instructions in unpredictable sequences
- Dynamic Scheduling: Arbiters and reorder buffers make autonomous decisions at runtime
Groq’s core insight is: the inference workload has no control flow uncertainty at runtime—the model’s computation graph is a directed acyclic graph (DAG) known at compile time. Therefore, all scheduling decisions can and should be made at compile time, rather than having the hardware guess [3].
This choice produces the following design consequences:
- No Caches: On-chip SRAM serves as primary weight storage, not a cache. All data access latency is known and constant
- No Branch Prediction: The compiler already knows all computational paths
- No Out-of-Order Execution: Instruction order is fixed by the compiler at compile time
- Static Scheduling: The compiler precisely calculates the timing of every instruction’s issue, execution, and completion
3.2 TSP Functionally-Sliced Microarchitecture
The TSP’s core architecture upends the traditional multi-core tiled design. In conventional chips, each tile is a complete processor core containing a variety of functional units. The TSP arranges functional units by type in a 2D grid—each vertical column (slice) contains one type of functional unit, termed a functionally-sliced microarchitecture [3].
flowchart TD
subgraph "TSP Chip - Functionally-Sliced Layout"
direction TB
subgraph "Four Functional Slice Columns (20 tiles per column, 16 SIMD lanes per tile = 320 lanes/column)"
MEM["MEM (Memory Read/Write)"]
VXM["VXM (Vector ALU)"]
MXM["MXM (Matrix Multiply)"]
SXM["SXM (Shift/Rotate)"]
end
ICU["ICU (Instruction Control Unit) — arranged horizontally, 144 instruction queues"]
end
Specific responsibilities of each functional slice [3]:
- MXM (Matrix Execution Module): Executes 320 x 320 fused dot product matrix multiplications—the core hardware for GEMM operations
- VXM (Vector Execution Module): Executes element-wise add, multiply, and activation functions
- SXM (Shift Execution Module): Vector shift and rotate operations for data format reorganization
- MEM (Memory Module): Manages read/write operations for 220 MB of globally shared SRAM
- ICU (Instruction Control Unit): Arranged horizontally, containing 144 independent instruction queues, capable of issuing multiple instructions per cycle
The fundamental difference between TSP and GPU in design is: a GPU’s SMs (Streaming Multiprocessors) are highly autonomous internally, each with independent schedulers; whereas the TSP’s ICU is distributed across the top of all slices, with instructions flowing from a centralized compiler-scheduled table to each slice—data passes between slices in a producer-consumer stream fashion. The compiler precisely schedules when each data element is written to SRAM, when it is read by which tile, and where the processed stream flows to next.
3.3 Streaming Execution Model
The core of the execution model is the vector stream. Vectors read from SRAM are assigned a stream ID (0-31) and a direction (East/West), passing between functional slices in a pipeline fashion. The execution of each instruction is interleaved in time—the ICU issues instruction A to the bottom tile at t1; at t2, that tile’s 16 result vectors propagate northward to the next tile, while the ICU issues instruction B to process the next 16-element block. This resembles an assembly line, where the rhythm of movement at all stations is pre-orchestrated by the compiler [3].
Key advantage brought by determinism: The compiler knows the exact latency of every instruction (because the hardware has no uncertainty), enabling it to solve a two-dimensional scheduling problem at compile time—precisely arranging every instruction and every data element in both time (when to issue) and space (which tile).
3.4 Compiler and ISA
The TSP’s compiler possesses complete control over the hardware:
| Architectural State | Quantity | Compiler Control Method |
|---|---|---|
| SIMD Lanes | 320 lanes | Compiler assigns workloads to 20 tiles x 16 lanes |
| Instruction Queues | 144 | Compiler controls program order per queue; HW has no OOO [3] |
| Logical Streams | 64 per lane (32 E + 32 W) | Compiler determines data direction and timing |
| Global SRAM | 220 MB | Compiler manages as primary storage |
The core difference from GPU programming: GPU developers must manually optimize CUDA kernels to handle cache behavior and thread scheduling uncertainty; Groq’s compiler automates all of this, with completely deterministic results [20].
3.5 TruePoint Numerical Precision
The LPU adopts a TruePoint mixed-precision strategy [21]:
- Storage: Weights stored in INT8 or FP8 to maximize SRAM utilization
- Computation: Internally uses 320-element fused dot product with high precision (FP32) to execute sensitive operations like attention logits
- Deterministic Rounding: Since computation order is fixed at compile time, rounding errors are completely predictable—in contrast to GPUs, where the same model may produce different floating-point rounding results on each inference run [21][22]
In a blog post published in December 2025, SambaNova claimed that Groq’s low-precision inference exhibited statistically significant accuracy degradation compared to FP32 baselines on certain tasks [22]. Groq’s counterarguments include: tests at Argonne National Laboratory demonstrated that TruePoint achieved 185x throughput on SARS-CoV-2 drug discovery workloads while maintaining FP32-level result accuracy [21]. At present, comprehensive independent third-party verification of precision remains limited.
3.6 Multi-Chip Scaling: Software-Defined Tensor Streaming Multiprocessor
A single LPU chip’s 230 MB SRAM is far from sufficient to accommodate large models—Llama 3.1 70B in FP8 requires approximately 70 GB, necessitating roughly 140 LPU v1 chips in parallel. Groq’s second ISCA paper (2022) described a large-scale TSP network scaling solution [23]:
- Topology: 2D torus network, with the compiler pre-scheduling inter-chip data flow
- Routing: Deterministic routing, without conventional routers and arbitration
- Flow Control: Compiler-managed producer-consumer model
- Theoretical Scaling Limit: 10,440 TSPs, end-to-end system latency <3 µs [23]
4. Generational Evolution and Specification Comparison
4.1 Complete Generational Specification Table
| Parameter | LPU v1 (TSP/GroqChip 1) | LPU v2 (4nm Transition) | Groq 3 LP30 (NVIDIA) |
|---|---|---|---|
| Process Node | GlobalFoundries 14nm [24] | Samsung 4nm [25] | Samsung SF4X [8] |
| Die Area | 25 x 29 mm (725 mm²) [3] | Not disclosed | Not disclosed |
| Frequency | 900 MHz [3] | Not disclosed | Not disclosed |
| Compute Density | >1 TOPS/mm² [3] | — | — |
| On-chip SRAM | 230 MB [26] | ~300-400 MB (est.) | 512 MB [8] |
| SRAM Bandwidth | 80 TB/s [26] | Not disclosed | 150 TB/s [8] |
| External Memory | No HBM | No HBM | No HBM |
| INT8 Compute | 750 TOPS [27] | — | — |
| FP16 Compute | 188 TFLOPS [27] | — | — |
| FP8 Compute | — | — | 1.2 PFLOPS [28] |
| Vector ALUs | 5,120 [27] | — | — |
| Matrix Multiply | 320x320 fused dot [3] | — | Enhanced version |
| TDP | ~300W [29] | — | — |
| Determinism | Full [3] | Full | Full |
| Status | Mass production (2020-2024) | Transition | Q3 2026 shipment |
Key generational evolution figures: SRAM capacity increased from 230 MB to 512 MB (2.2x); bandwidth rose from 80 TB/s to 150 TB/s (1.9x). While the absolute increase is modest, this is achieved against the backdrop of SRAM density being unable to scale as rapidly as DRAM—an SRAM bit cell requires 6 transistors, whereas DRAM requires only 1 transistor plus a capacitor—making a 2x capacity improvement per generation non-trivial.
4.2 Groq 3 LPX System Specifications
The core value proposition of Groq 3 lies in inference disaggregation—splitting the prefill (compute-intensive) and decode (bandwidth-intensive) phases of inference across different hardware.
flowchart LR
USER["User Query"] --> P["Vera Rubin NVL72
72 x Rubin GPU
Prefill Phase
288 GB HBM4, 22 TB/s"]
P -->|"Dynamo Orchestration Layer
Prefill → Decode Separation"| D["Groq 3 LPX Rack
256 x LP30
Decode Phase
128 GB SRAM, 40 PB/s"]
D --> R["Low-Latency Token Output"]
| LPX Rack Specification | Value |
|---|---|
| Number of LP30 Chips | 256 (32 x 1U compute tray) [28] |
| Total On-chip SRAM | 128 GB [28] |
| Aggregate SRAM Bandwidth | 40 PB/s [28] |
| Total Compute (FP8) | 315 PFLOPS [28] |
| chip-to-chip scaling bandwidth | 640 TB/s [28] |
NVIDIA claims that LPX + Vera Rubin NVL72 delivers 35x higher throughput-per-megawatt than Blackwell NVL72 on trillion-parameter models, targeting a token price of $45 per million tokens [8].
Subsequent chips in NVIDIA’s roadmap: LP35 (adds NVFP4 support, aligning with Rubin Ultra), LP40 (planned for Feynman architecture) [8].
4.3 Architectural Comparison with NVIDIA GPU
| Comparison Dimension | Groq LP30 | NVIDIA Rubin GPU |
|---|---|---|
| On-chip Storage | 512 MB SRAM | ~50 MB L2 Cache |
| Storage Speed | 150 TB/s (on-chip) | 22 TB/s (HBM4 off-chip) |
| Storage Capacity | 512 MB per chip | 288 GB HBM4 |
| Latency Consistency | Fully deterministic (no cache misses) | Cache hierarchy non-deterministic |
| Applicable Phase | Decode-dedicated | Prefill + Decode general-purpose |
| Compiler | Static scheduling, zero runtime overhead | CUDA kernel dynamic scheduling |
5. Performance Benchmarks and Energy Efficiency Analysis
5.1 Inference Latency and Throughput
Performance data for Groq LPU across various open-source models:
| Model | Groq LPU | GPU Comparison | GPU Platform | Speedup | Source |
|---|---|---|---|---|---|
| Llama 2 70B | ~300 tok/s | ~30 tok/s | H100 cluster | ~10x | [5] |
| Llama 3 70B | 500-750 tok/s | 10-40 tok/s | H100/H200 | ~15-50x | [30] |
| Gemma 7B | ~814 tok/s | ~100 tok/s | GPU | ~8x | [32] |
| Mistral Large | ~320 tok/s | ~28 tok/s | A100 | ~11x | [33] |
| Mixtral 8x7B | ~500 tok/s | ~40 tok/s | H100 | ~12x | [34] |
| Phi-3 | 3,200 tok/s | ~600 tok/s | H100 + vLLM | ~5x | [35] |
| Llama 3 8B | ~500-600 tok/s | ~80 tok/s | H100 | ~7x | [34] |
It is important to note that these data points originate from multiple sources and varying test conditions, and do not represent A/B testing under a unified benchmark. However, the overall trend is consistent: in single-user/low-batch (batch=1) scenarios, Groq LPU’s speed advantage is most pronounced (10-50x). As batch size increases, GPU utilization rises, narrowing the gap.
5.2 Latency Determinism
A key, often underestimated advantage of Groq is extremely low latency variance [4]:
- Time To First Token (TTFT) ~0.22s, largely unaffected by system load
- Latency variance per inference run for the same model under the same configuration <5%
- GPU system latency variance under identical conditions can reach 30-50%, primarily due to HBM refresh cycles and cache contention
This characteristic is crucial for real-time interactive AI applications (voice assistants, Agentic AI).
5.3 Energy Efficiency
| Metric | Groq LPU | GPU (H100) | Ratio |
|---|---|---|---|
| Joules/token | 1-3 J [36] | 10-30 J [36] | ~10x |
| Energy/1M tokens | 1-3 kWh | 10-30 kWh | ~10x |
| Single card power | ~300W | 700W | ~0.4x |
| Single card price | ~$20,000 [37] | ~$28,000-38,000 [37] | ~0.5-0.7x |
Groq LPU’s energy efficiency advantage has three physical origins: (1) SRAM read energy is approximately 0.1-0.3 pJ/bit, whereas HBM reads (including TSV + SerDes) are about 5 pJ/bit, a 17-50x difference; (2) the deterministic architecture eliminates waste from speculative execution and cache misses; (3) lower overall TDP.
6. GroqCloud Platform and Pricing Model
GroqCloud launched in February 2024, offering an OpenAI-compatible API [18]. As of 2026, three deployment tiers are available:
- Public Cloud (GroqCloud): Token-based billing, with Free/Developer/Enterprise rate tiers
- Private/Dedicated Cloud: Custom capacity and key data residency
- GroqRack On-Premise: For government, finance, and other regulated industries, air or liquid cooling [38]
6.1 Supported Models and Pricing
Groq supports only open-weight models; it does not host proprietary models such as GPT-5.5, Claude, or Gemini [39]:
| Model | Input Price | Output Price |
|---|---|---|
| Llama 3.1 8B | $0.05 / M tokens | $0.08 / M tokens |
| Llama 3.3 70B | $0.59 / M tokens | $0.79 / M tokens |
| DeepSeek R1 Distill 70B | $0.75 / M | $0.99 / M |
| Whisper Turbo (Speech) | $0.04 / hour | — |
| GPT-OSS, Qwen3 32B, Kimi K2 | Varies by model | — |
Free tier rate limits: 30 RPM, 1K RPD for all models; Llama 70B: 12K TPM, 100K TPD [40]. Batch API offers a 50% discount (24h-7d processing window) [39].
7. Business Model and Financial Analysis
7.1 Revenue Forecast Evolution
Groq’s financial trajectory presents a case study of an AI hardware startup navigating the gap between “revenue illusion and reality”:
- Early 2025: Forecast annual revenue of over $2B to investors [41]
- July 2025: Sharply revised down to approximately $500M—a 75% cut within three months [41][42]
Primary reasons for the downgrade [43]: (1) Insufficient data center capacity—physical deployment speed of LPU clusters fell behind expectations; (2) Partial revenue from the Saudi agreement deferred to 2026 recognition; (3) Enterprise customer signing pace below forecasts; (4) Chip production yield and delivery cycle challenges.
7.2 Key Financial Metrics
| Metric | Value | Conditions/Source |
|---|---|---|
| 2025 Revenue Forecast | ~$500M | Post-downgrade [41] |
| 2023 Net Loss | -$88M | Public data [1] |
| 2024 ARR | ~$172M | Latka estimate [45] |
| GroqCloud Developers | ~2 million | Company disclosure [46] |
| Key Customers | Bell Canada, Aramco Digital, Saudi Arabia [47] | Enterprise contracts |
| Customer Concentration Risk | Highly dependent on 1-2 Middle Eastern entities | Saudi Arabia accounts for majority of agreement [48] |
7.3 Cumulative Funding
Groq raised approximately $1.87B in equity financing from 2017 to 2026, plus a $1.5B infrastructure commitment from Saudi Arabia, totaling approximately $3.37B. Adding the $20B NVIDIA deal consideration, the total valuation of Groq’s technology + talent + assets approximates $23B—though founders and investors received substantial returns, the commercial reality of the company’s independent operational side remained stark [44].
7.4 Saudi Arabia Agreement
The $1.5B commitment was the most critical non-equity funding source during Groq’s independent period [15]. The core of this agreement:
- Infrastructure: Build a GroqCloud data center in Dammam—the largest AI inference hub in the EMEA region
- Partner: Aramco Digital, providing inference capacity for its Norous voice AI and Allam bilingual model
- Strategic Alignment: Aligns with Saudi Vision 2030’s AI economic diversification strategy
This also introduced significant risk: Groq’s lion’s share of revenue and Middle East expansion plans were highly dependent on sustained investment from Saudi Arabia [48].
8. Competitive Landscape Analysis
8.1 Inference Chip Panorama
The AI inference chip market in 2026 has formed a “three-way split” structure:
- GPU General-Purpose: NVIDIA H100/B200, AMD MI300X—flexible, CUDA ecosystem, but low inference efficiency
- ASIC Inference-Dedicated: Groq LPU, Cerebras WSE-3, SambaNova SN40L, Etched Sohu, MatX—high $ inference performance, 10-100x vs GPU
- Hyperscaler Custom: Google TPU v7 (Ironwood, 4,614 TFLOPS/chip), AWS Inferentia, Meta MTIA (four-generation roadmap), Microsoft Maia—vertically integrated, locked-in workloads
8.2 Core Competitor Comparison
| Dimension | Groq LPU | Cerebras WSE-3 | NVIDIA H100/B200 | SambaNova SN40L |
|---|---|---|---|---|
| Chip Form Factor | Single die ASIC | Wafer-scale (46,225 mm²) | Single die GPU | Multi-die reconfigurable |
| On-chip SRAM | 230/512 MB | 44 GB | ~50 MB (L2) | Not disclosed |
| Memory Bandwidth | 80/150 TB/s | 21 PB/s | 3.35 TB/s | Not disclosed |
| Supports Training | ❌ | ✅ | ✅ | ✅ |
| Determinism | Full | Wafer-scale | No | Partial |
| FP8 Compute | 1.2 PFLOPS (v3) | 125 PFLOPS | 4.5 PFLOPS (B200) | — |
| Max Model on Single Chip | ~1-7B (FP8) | ~100B+ | ~70B | — |
| Compiler Model | Static scheduling | Wafer mapping | CUDA kernel | Dataflow mapping |
| Business Model | Inference API + Cloud | Training+Inference Cloud+On-prem | Full stack | Training+Inference |
8.3 Cerebras: The Most Direct Comparison
Cerebras WSE-3 shares an SRAM-centric design philosophy with Groq, but differs significantly in scale and capability:
- Cerebras Advantages: 44 GB on-chip SRAM vs 230/512 MB—single chip can accommodate 100B+ parameter models without cross-chip data movement; supports both training and inference; customer G42, $20B chip procurement agreement with OpenAI [50]
- Groq Advantages: More mature software stack (OpenAI-compatible API), larger developer community (2M vs Cerebras ~100K), lower Time To First Token (TTFT 0.22s vs ~0.5s)
8.4 Industry Consolidation Wave
A series of acquisitions during 2025-2026 indicates that independent inference chip startups face a binary choice of “be acquired vs. be marginalized” [8]:
- Groq → NVIDIA ($20B, 2025.12)
- Untether AI → AMD acquires engineering team (2025.06)
- Enfabrica → NVIDIA acquires (~$900M, 2025.09)
- Rivos → Meta acquires (2025.10)
- SambaNova → Intel acquisition talks fail; instead, $350M investment + partnership (2026.02)
- Cerebras → Remains independent, Nasdaq IPO May 2026, valuation $95B
9. NVIDIA Acquisition
9.1 Deal Structure
Agreement reached on December 24, 2025 [6][7]:
| Element | Detail |
|---|---|
| Consideration | Approximately $20B (all cash) |
| Nature | Non-exclusive technology license + talent acquisition [6] |
| NVIDIA Acquires | All Groq patents + software stack + founding team + core engineering |
| Joining NVIDIA | Jonathan Ross (CEO), Sunny Madra (President), architecture/compiler/systems teams |
| Groq Independent Entity | Continues operations, retains LPU production and GroqCloud. CEO: Simon Edwards (ex-CFO) → Adam Winter |
| Largest NVIDIA Deal | Previous record: Mellanox ~$6.9B (2019) |
9.2 Five-Layer Strategic Analysis
1. Technology Layer: GPUs are inherently inefficient at autoregressive decode. Each token generation requires reading the full set of weights from HBM (approximately 140 GB for a 70B model); the GPU’s compute-to-bandwidth ratio is severely imbalanced in this scenario. Groq LPU’s 150 TB/s SRAM bandwidth raises this bottleneck 7x from 22 TB/s (HBM4), with fully predictable latency.
2. Product Layer: NVIDIA had originally planned to launch Rubin CPX—an inference accelerator based on GDDR7. However, CPX’s assumed ~2 TB/s GDDR7 bandwidth was utterly uncompetitive against Groq LPU’s 150 TB/s SRAM. Following GTC 2026, CPX has disappeared from NVIDIA’s roadmap [8].
3. Competitive Layer: Had Groq remained independent and been acquired by the Cerebras/OpenAI camp, it would have formed a second inference hardware ecosystem outside of GPU—posing a serious threat to NVIDIA’s long-term market dominance.
4. Platform Layer: The InfiniBand/SmartNIC technology from the 2019 Mellanox acquisition ($6.9B) was later integrated into NVLink and NVSwitch, becoming an inseparable part of NVIDIA’s AI infrastructure. Groq’s deterministic compiler and LPU architecture will follow the same path.
5. Financial Layer: Acquiring proven technology + team for 10% of NVIDIA’s ~$200B annual revenue is efficient in both time and cost compared to internal development (3-5 years + tens of billions in R&D investment).
9.3 Market Reaction
Bernstein analyst Stacy Rasgon noted: the deal’s non-exclusive licensing structure (Groq can license to companies other than NVIDIA) “may keep the fiction of competition alive, while effectively neutralizing a competitor” [52]. This structure also made the deal easier to navigate antitrust review.
AWS has already announced deployment of Groq 3 LPU + over one million NVIDIA GPUs as an extended collaboration at GTC 2026 [8].
10. Limitations, Controversies, and Unresolved Issues
10.1 SRAM Capacity Ceiling
SRAM’s physical density is approximately 1/100 to 1/200 that of DRAM—because an SRAM bit cell typically requires 6 transistors, while DRAM requires only 1 + 1 capacitor. Even with Groq 3 doubling SRAM to 512 MB, there remains a 160-560x capacity gap compared to GPU’s 80-288 GB HBM.
Practical consequences:
- Llama 3.1 70B (FP8) requires approximately 60 LP30 chips in parallel
- Trillion-parameter models: LPX rack’s 128 GB SRAM can only hold roughly 1/10 of the model, requiring external DDR5 supplementation
- Inter-chip communication overhead grows non-linearly with chip count
10.2 No Training Support
The LPU’s determinism and static scheduling make it unusable for model training [53]. Training requires: (1) dynamic backpropagation; (2) gradient accumulation; (3) iterative weight updates—all of which the LPU’s deterministic architecture cannot process. Cerebras WSE-3 holds a significant advantage here: single-chip support for training + inference.
10.3 TruePoint Precision Controversy
In a December 2025 technical blog post, SambaNova claimed that Groq’s low-precision inference exhibited statistically significant accuracy degradation across multiple NLU tasks [22]. Independent third-party verification remains insufficient.
10.4 Ecosystem Limitations
- Open-weight models only: Cannot run GPT, Claude, Gemini [39]
- Shallow model selection: ~10 hosted models vs AWS’s hundreds
- No LoRA/adapter support: Cannot fine-tune or customize
- Immature toolchain: Relative to CUDA ecosystem’s 15 years of accumulation
10.5 Independent Entity
Following the NVIDIA deal, the Groq independent entity (raised $650M in May 2026) faces contradictions:
- Lost core IP exclusivity: Licensed to NVIDIA
- Lost founding team: Ross + Madra + core engineering at NVIDIA
- Faces intense competition: Cerebras, Fireworks, Together, OpenRouter all vying for the inference cloud market
- But retains key assets: 2 million developers, GroqCloud API, ability to license to other parties if necessary
11. Summary and Future Outlook
11.1 Core Contributions
Groq’s LPU has left three distributed contributions in AI hardware history:
- Deterministic Architecture as a Third Path: Beyond GPU (flexible, non-deterministic) and fixed-function ASIC (non-programmable), proved that “programmable + fully deterministic” is a viable path
- SRAM-centric Inference Paradigm: Demonstrated the feasibility of on-chip SRAM as primary weight storage (rather than cache), providing a reference for subsequent SRAM-based inference designs
- Compiler-Hardware Co-design: Elevated the compiler to a first-class citizen in hardware scheduling, with the compiler possessing complete control over architectural state—this philosophy has influenced Etched, MatX, D-Matrix, and other companies
11.2 Key Insights
Groq’s story—from obscurity in 2016, to viral fame in 2024, to NVIDIA’s $20B acquisition at the end of 2025—reveals several trends in the AI inference hardware market:
- The optimal exit for independent inference chip startups is increasingly becoming acquisition by a major platform
- Inference disaggregation (prefill + decode assigned to different hardware) will become an industry standard
- The physical limits of SRAM capacity mean SRAM-centric designs are suited for specific market segments (ultra-low-latency inference), not as a universal replacement
11.3 Outlook (2026-2030)
- 2026 Q3: Groq 3 LPX ships, inference disaggregation formally enters productization
- 2027: LP35 + NVFP4, Rubin Ultra compatibility; AI inference market size approaches $200B [49]; hyperscaler custom + ASIC + GPU three-way split
- 2028+: LP40 with Feynman architecture; inference ASIC market share projected to grow from ~5% in 2025 to ~20-30% by 2030
References
- Groq - Wikipedia. https://en.wikipedia.org/wiki/Groq
- Williams, W. (Feb 2024). “Groq’s ultrafast LPU could well be the first LLM-native processor”. TechRadar Pro.
- Abts, D.; Ross, J.; et al. (May 2020). “Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads”. ISCA 2020. doi:10.1109/ISCA45697.2020.00023
- Groq (Feb 2024). “Groq LPU Inference Engine Crushes First Public LLM Benchmark”. groq.com/blog.
- Introl Blog (Jan 2026). “Groq LPU Infrastructure: Ultra-Low Latency AI Inference Guide 2025”.
- Nellis, S. (Dec 2025). “Nvidia, joining Big Tech deal spree, to license Groq technology, hire executives”. Reuters.
- Silberling, A. (Dec 2025). “Nvidia to license AI chip challenger Groq’s tech and hire its CEO”. TechCrunch.
- James, L. (Mar 2026). “How Nvidia’s $20 billion Groq 3 LPU deal reshapes the Nvidia Vera Rubin Platform”. Tom’s Hardware.
- LinkedIn. “Jonathan Ross — Chief Software Architect @ Nvidia”. linkedin.com/in/ross-jonathan.
- MSN. “Jonathan Ross net worth after $20 billion Nvidia deal”.
- Clark, K. (Sep 2018). “Secretive semiconductor startup Groq raises $52M from Social Capital”. TechCrunch.
- King, I. (Apr 2021). “Tiger Global, D1 Lead $300 Million Round in AI Chip Startup Groq”. Bloomberg.
- Wiggers, K. (Aug 2024). “AI chip startup Groq lands $640M to challenge Nvidia”. TechCrunch.
- Reuters (Sep 2025). “Groq more than doubles valuation to $6.9 billion”.
- Groq (Feb 2025). “Saudi Arabia Announces $1.5 Billion Expansion to Fuel AI-powered Economy with Groq”.
- Silberling, A. (May 2026). “After Nvidia’s $20B not-acqui-hire, AI chip startup Groq reportedly raising $650M”. TechCrunch.
- Groq (Mar 2022). “Groq Acquires Dataflow Systems Pioneer Maxeler Technologies”. PRNewswire.
- TechCrunch (Mar 2024). “AI chip startup Groq forms new business unit, acquires Definitive Intelligence”.
- Williams, W. (Feb 2024). “‘Feels like magic!’ — Groq’s ultrafast LPU”. TechRadar Pro.
- Upadhyay, A. (Mar 2024). “The Architecture of Groq’s LPU”. Coding Confessions Blog.
- Groq. “TruePoint Technology — Stop Compromising Accuracy for Performance”. GroqDocs.
- SambaNova (Dec 2025). “Does reduced precision hurt? A bit about losing bits”. sambanova.ai/blog.
- Abts, D.; Kimmell, G.; et al. (Jun 2022). “A software-defined tensor streaming multiprocessor for large-scale ML”. ISCA 2022. doi:10.1145/3470496.3527405
- Ward-Foxton, S. (Jan 2020). “Groq’s AI Chip Debuts in the Cloud”. EETimes.
- Hwang, J-S. (Aug 2023). “Samsung’s new US chip fab wins first foundry order from Groq”. Korea Economic Daily.
- Groq (Aug 2025). “Inside the LPU: Deconstructing Groq’s Speed”. groq.com/blog.
- GroqChip Processor Product Brief v1.7 (PDF). groq.sa.
- The Register (Mar 2026). “A closer look at Nvidia’s Groq-powered LPX rack systems”.
- Awesome Agents (Mar 2026). “Groq LPU — Deterministic Inference at Scale”.
- Silicon Analysts (Dec 2024). “Nvidia vs Groq: The Inference Acceleration Battle”.
- Markaicode (Jun 2026). “Groq Mixtral H100 Throughput: 480 tok/s on Llama 3 70B”.
- Groq (Mar 2024). “Groundbreaking Gemma 7B Performance running on the Groq LPU Inference Engine”.
- Markaicode. “Mistral Large on A100 vs Groq LPU: VRAM Benchmark”.
- Beebom. “Meet Groq, a Lightning Fast AI Accelerator that Beats ChatGPT & Gemini”.
- Markaicode. “Groq vs vLLM on H100: Phi-3 Throughput Hits 3,200 Tokens/Sec”.
- Li, Z. “Groq’s Deterministic Architecture is Rewriting the Physics of AI Inference”. Medium.
- Silicon Analysts. “Nvidia vs Groq: Cost Analysis Hardware Pricing”.
- Groq. “GroqRack Compute Cluster”. groq.com/groqrack.
- Groq On-Demand Pricing. https://groq.com/pricing
- TokenMix Blog (Apr 2026). “Groq API Pricing 2026: Free Tier, $0.05/M Paid Models”.
- The Information (Jul 2025). “Groq slashes 2025 revenue projections to $500 million”.
- Investing.com (Jul 2025). “Groq slashes 2025 revenue projections to $500 million”.
- TrendForce (Jul 2025). “Groq Cuts 2025 Revenue Projection by USD 1.5B”.
- Sacra. “Groq revenue, valuation & funding”. sacra.com/c/groq/.
- Latka. “Groq Revenue 2025: $172.5M ARR”. getlatka.com.
- Sacra (Feb 2026). “Equity Research Groq” (PDF).
- Data Center Dynamics (Feb 2025). “Groq secures $1.5bn from Saudi Arabia”.
- Reuters (Jul 2025). “AI chip startup Groq discusses $6 billion valuation”.
- Silicon Analysts (Apr 2026). “AMD vs NVIDIA AI GPU Market Share 2026”.
- Digitimes (Apr 2026). “Nvidia and OpenAI both make US$20 billion bets on AI chip startups”.
- TrendForce (2026). “Custom ASIC shipments from cloud providers growing 44.6% in 2026”.
- CNBC (Dec 2025). “Nvidia-Groq deal is structured to keep ‘fiction of competition alive,’ analyst says”.
- Cryptonomist (Apr 2026). “NVIDIA pairs Rubin GPUs with Groq LPU to cut latency, boost inference”.