Groq and LPU

1. Overview

Groq, Inc. is an AI chip company headquartered in Mountain View, California, founded in 2016 by former Google TPU core designer Jonathan Ross [1]. The company’s processor architecture was initially called the Tensor Streaming Processor (TSP), later rebranded as the Language Processing Unit (LPU) during the large language model wave of 2023-2024 [2].

Groq’s core philosophy rests on a radical and elegant design choice: discard all non-deterministic hardware mechanisms accumulated over forty years in the computing industry, and hand execution scheduling authority entirely to the compiler [3]. In traditional CPUs and GPUs, cache hierarchies, branch prediction, out-of-order execution, and dynamic scheduling are core mechanisms for boosting average performance, but they also introduce latency unpredictability. Groq’s design team realized that for inference workloads—whose computation graphs are known and fixed at runtime—these mechanisms are not merely superfluous, but actively harmful.

This choice enabled Groq’s LPU to achieve unprecedented levels of LLM inference latency. In early 2024, Groq performed so outstandingly in independent benchmarks by ArtificialAnalysis.ai that the testers were forced to extend the chart axes just to fit Groq’s data points on the graph [4]. On Llama 2 70B, Groq achieved an output speed of approximately 300 token/s, roughly 10x faster than a comparable NVIDIA H100 cluster [5].

On December 24, 2025, NVIDIA reached a technology licensing and talent acquisition agreement with Groq for approximately $20 billion [6][7]. In March 2026, at GTC, NVIDIA unveiled the first chip born from this collaboration: the Groq 3 LPU (LP30)—built on Samsung 4nm process, integrating 512 MB of on-chip SRAM, delivering 150 TB/s of memory bandwidth—serving as a dedicated decode-stage co-processor within the Vera Rubin platform, with shipments planned for Q3 2026 [8].

Key Timeline:

2016: Company founded, Jonathan Ross leaves Google
2017: $10M seed round from Social Capital
2021: $300M Series C (Tiger Global, D1 Capital), valuation exceeds $1B
2022: Acquisition of Maxeler Technologies (dataflow computing)
2023: Samsung 4nm production line selected, TSP rebranded as LPU
2024.02: GroqCloud developer platform launched
2024.03: Acquisition of Definitive Intelligence (Sunny Madra joins)
2024.08: $640M Series D (led by BlackRock), valuation $2.8B
2025.02: $1.5B infrastructure commitment from Saudi Arabia
2025.07: Revenue forecast slashed from $2B to $500M
2025.09: Valuation rises to $6.9B
2025.12: NVIDIA acquires Groq technology + talent for $20B
2026.03: Groq 3 LPU (LP30) unveiled at GTC
2026.05: Groq independent entity raises $650M, pivots to AI inference cloud

2. Historical Evolution and Founding Background

2.1 Founder and TPU

Groq’s founder, Jonathan Ross, was one of the core designers of Google’s TPU [9]. The TPU was a custom ASIC developed by Google around 2015 for internal inference workloads. Ross amassed deep experience in AI accelerator design during the TPU project—particularly the approach of deriving hardware design from domain-specific workloads. In 2016, Ross co-founded Groq with another former Google engineer, Douglas Wightman [1]. Wightman departed Groq in 2019 [10]; Ross subsequently served as CEO until the completion of the NVIDIA deal.

2.2 Funding History

Round	Date	Amount	Key Investors	Valuation
Seed	2017	$10M	Social Capital (Chamath Palihapitiya) [11]	—
Series A/B	2018	$52M	Social Capital et al. [11]	—
Series C	2021.04	$300M	Tiger Global, D1 Capital [12]	>$1B
Series D	2024.08	$640M	BlackRock PE, Cisco, Samsung Catalyst [13]	$2.8B
Series D+	2024.09	$750M	Disruptive, BlackRock, Neuberger Berman [14]	~$6.9B
Saudi Commitment	2025.02	$1.5B (infrastructure)	Kingdom of Saudi Arabia [15]	—
Bridge Financing	2026.05	$650M	Disruptive, Infinitum (pro-rata) [16]	Not disclosed

Social Capital’s Chamath Palihapitiya entered Groq with a $10M seed round in 2017—a time when Silicon Valley chip startups were considered “venture capital poison” [11]. By the time of the NVIDIA deal in 2025, this investment had multiplied into billions of dollars.

2.3 Key Acquisitions

Groq made two significant acquisitions in its history:

Maxeler Technologies (March 2022) [17]: Acquired this London-based dataflow computing company, founded by Dr. Oskar Mencer in 2003. Maxeler’s team of approximately 20 joined Groq’s London office, bringing deep expertise in FPGA dataflow systems and high-performance computing. This acquisition provided critical talent for Groq’s multi-chip scaling network design.

Definitive Intelligence (March 2024) [18]: This acquisition directly spawned the GroqCloud business unit. Definitive Intelligence’s co-founder and CEO Sunny Madra joined Groq to lead GroqCloud—he had previously founded Autonomic (acquired by Ford in 2018). Madra later became President of Groq and joined NVIDIA following the deal [7].

2.4 Early Strategy: The Unexpected Pivot from CNN to LLM

The TSP was not originally designed for large language models. Its 2020 ISCA paper primarily targeted convolutional neural networks and traditional deep learning inference [3]. Following the explosion of ChatGPT in late 2022, Groq quickly recognized its architecture’s unique advantages for transformer-based LLMs—particularly the bandwidth-sensitive, latency-deterministic nature of autoregressive decoding. During 2023-2024, the company rebranded the TSP as the Language Processing Unit (LPU) [2], shifting its market positioning from “general-purpose AI accelerator” to “LLM inference-dedicated engine.”

3. In-Depth Technical Architecture Analysis

3.1 Design Philosophy: Determinism First

The core characteristic of traditional CPU and GPU microarchitectures is non-deterministic execution. A program run twice on the same input may produce different exact instruction timings each time. The sources of this non-determinism include:

Cache Hierarchy: The latency difference between a cache hit (~10 cycles) and a cache miss (~200 cycles) can be 20x
Branch Prediction: Pipeline flush and rollback on misprediction wastes 10-20 cycles
Out-of-Order Execution: Hardware dynamically reorders instructions in unpredictable sequences
Dynamic Scheduling: Arbiters and reorder buffers make autonomous decisions at runtime

Groq’s core insight is: the inference workload has no control flow uncertainty at runtime—the model’s computation graph is a directed acyclic graph (DAG) known at compile time. Therefore, all scheduling decisions can and should be made at compile time, rather than having the hardware guess [3].

This choice produces the following design consequences:

No Caches: On-chip SRAM serves as primary weight storage, not a cache. All data access latency is known and constant
No Branch Prediction: The compiler already knows all computational paths
No Out-of-Order Execution: Instruction order is fixed by the compiler at compile time
Static Scheduling: The compiler precisely calculates the timing of every instruction’s issue, execution, and completion

3.2 TSP Functionally-Sliced Microarchitecture

The TSP’s core architecture upends the traditional multi-core tiled design. In conventional chips, each tile is a complete processor core containing a variety of functional units. The TSP arranges functional units by type in a 2D grid—each vertical column (slice) contains one type of functional unit, termed a functionally-sliced microarchitecture [3].

flowchart TD
    subgraph "TSP Chip - Functionally-Sliced Layout"
        direction TB
        
        subgraph "Four Functional Slice Columns (20 tiles per column, 16 SIMD lanes per tile = 320 lanes/column)"
            MEM["MEM (Memory Read/Write)"]
            VXM["VXM (Vector ALU)"]
            MXM["MXM (Matrix Multiply)"]
            SXM["SXM (Shift/Rotate)"]
        end
        
        ICU["ICU (Instruction Control Unit) — arranged horizontally, 144 instruction queues"]
    end

Specific responsibilities of each functional slice [3]:

MXM (Matrix Execution Module): Executes 320 x 320 fused dot product matrix multiplications—the core hardware for GEMM operations
VXM (Vector Execution Module): Executes element-wise add, multiply, and activation functions
SXM (Shift Execution Module): Vector shift and rotate operations for data format reorganization
MEM (Memory Module): Manages read/write operations for 220 MB of globally shared SRAM
ICU (Instruction Control Unit): Arranged horizontally, containing 144 independent instruction queues, capable of issuing multiple instructions per cycle

The fundamental difference between TSP and GPU in design is: a GPU’s SMs (Streaming Multiprocessors) are highly autonomous internally, each with independent schedulers; whereas the TSP’s ICU is distributed across the top of all slices, with instructions flowing from a centralized compiler-scheduled table to each slice—data passes between slices in a producer-consumer stream fashion. The compiler precisely schedules when each data element is written to SRAM, when it is read by which tile, and where the processed stream flows to next.

3.3 Streaming Execution Model

The core of the execution model is the vector stream. Vectors read from SRAM are assigned a stream ID (0-31) and a direction (East/West), passing between functional slices in a pipeline fashion. The execution of each instruction is interleaved in time—the ICU issues instruction A to the bottom tile at t1; at t2, that tile’s 16 result vectors propagate northward to the next tile, while the ICU issues instruction B to process the next 16-element block. This resembles an assembly line, where the rhythm of movement at all stations is pre-orchestrated by the compiler [3].

Key advantage brought by determinism: The compiler knows the exact latency of every instruction (because the hardware has no uncertainty), enabling it to solve a two-dimensional scheduling problem at compile time—precisely arranging every instruction and every data element in both time (when to issue) and space (which tile).

3.4 Compiler and ISA

The TSP’s compiler possesses complete control over the hardware:

Architectural State	Quantity	Compiler Control Method
SIMD Lanes	320 lanes	Compiler assigns workloads to 20 tiles x 16 lanes
Instruction Queues	144	Compiler controls program order per queue; HW has no OOO [3]
Logical Streams	64 per lane (32 E + 32 W)	Compiler determines data direction and timing
Global SRAM	220 MB	Compiler manages as primary storage

The core difference from GPU programming: GPU developers must manually optimize CUDA kernels to handle cache behavior and thread scheduling uncertainty; Groq’s compiler automates all of this, with completely deterministic results [20].

3.5 TruePoint Numerical Precision

The LPU adopts a TruePoint mixed-precision strategy [21]:

Storage: Weights stored in INT8 or FP8 to maximize SRAM utilization
Computation: Internally uses 320-element fused dot product with high precision (FP32) to execute sensitive operations like attention logits
Deterministic Rounding: Since computation order is fixed at compile time, rounding errors are completely predictable—in contrast to GPUs, where the same model may produce different floating-point rounding results on each inference run [21][22]

In a blog post published in December 2025, SambaNova claimed that Groq’s low-precision inference exhibited statistically significant accuracy degradation compared to FP32 baselines on certain tasks [22]. Groq’s counterarguments include: tests at Argonne National Laboratory demonstrated that TruePoint achieved 185x throughput on SARS-CoV-2 drug discovery workloads while maintaining FP32-level result accuracy [21]. At present, comprehensive independent third-party verification of precision remains limited.

3.6 Multi-Chip Scaling: Software-Defined Tensor Streaming Multiprocessor

A single LPU chip’s 230 MB SRAM is far from sufficient to accommodate large models—Llama 3.1 70B in FP8 requires approximately 70 GB, necessitating roughly 140 LPU v1 chips in parallel. Groq’s second ISCA paper (2022) described a large-scale TSP network scaling solution [23]:

Topology: 2D torus network, with the compiler pre-scheduling inter-chip data flow
Routing: Deterministic routing, without conventional routers and arbitration
Flow Control: Compiler-managed producer-consumer model
Theoretical Scaling Limit: 10,440 TSPs, end-to-end system latency <3 µs [23]

4. Generational Evolution and Specification Comparison

4.1 Complete Generational Specification Table

Parameter	LPU v1 (TSP/GroqChip 1)	LPU v2 (4nm Transition)	Groq 3 LP30 (NVIDIA)
Process Node	GlobalFoundries 14nm [24]	Samsung 4nm [25]	Samsung SF4X [8]
Die Area	25 x 29 mm (725 mm²) [3]	Not disclosed	Not disclosed
Frequency	900 MHz [3]	Not disclosed	Not disclosed
Compute Density	>1 TOPS/mm² [3]	—	—
On-chip SRAM	230 MB [26]	~300-400 MB (est.)	512 MB [8]
SRAM Bandwidth	80 TB/s [26]	Not disclosed	150 TB/s [8]
External Memory	No HBM	No HBM	No HBM
INT8 Compute	750 TOPS [27]	—	—
FP16 Compute	188 TFLOPS [27]	—	—
FP8 Compute	—	—	1.2 PFLOPS [28]
Vector ALUs	5,120 [27]	—	—
Matrix Multiply	320x320 fused dot [3]	—	Enhanced version
TDP	~300W [29]	—	—
Determinism	Full [3]	Full	Full
Status	Mass production (2020-2024)	Transition	Q3 2026 shipment

Key generational evolution figures: SRAM capacity increased from 230 MB to 512 MB (2.2x); bandwidth rose from 80 TB/s to 150 TB/s (1.9x). While the absolute increase is modest, this is achieved against the backdrop of SRAM density being unable to scale as rapidly as DRAM—an SRAM bit cell requires 6 transistors, whereas DRAM requires only 1 transistor plus a capacitor—making a 2x capacity improvement per generation non-trivial.

4.2 Groq 3 LPX System Specifications

The core value proposition of Groq 3 lies in inference disaggregation—splitting the prefill (compute-intensive) and decode (bandwidth-intensive) phases of inference across different hardware.

flowchart LR
    USER["User Query"] --> P["Vera Rubin NVL72
72 x Rubin GPU
Prefill Phase
288 GB HBM4, 22 TB/s"]
    P -->|"Dynamo Orchestration Layer
Prefill → Decode Separation"| D["Groq 3 LPX Rack
256 x LP30
Decode Phase
128 GB SRAM, 40 PB/s"]
    D --> R["Low-Latency Token Output"]

LPX Rack Specification	Value
Number of LP30 Chips	256 (32 x 1U compute tray) [28]
Total On-chip SRAM	128 GB [28]
Aggregate SRAM Bandwidth	40 PB/s [28]
Total Compute (FP8)	315 PFLOPS [28]
chip-to-chip scaling bandwidth	640 TB/s [28]

NVIDIA claims that LPX + Vera Rubin NVL72 delivers 35x higher throughput-per-megawatt than Blackwell NVL72 on trillion-parameter models, targeting a token price of $45 per million tokens [8].

Subsequent chips in NVIDIA’s roadmap: LP35 (adds NVFP4 support, aligning with Rubin Ultra), LP40 (planned for Feynman architecture) [8].

4.3 Architectural Comparison with NVIDIA GPU

Comparison Dimension	Groq LP30	NVIDIA Rubin GPU
On-chip Storage	512 MB SRAM	~50 MB L2 Cache
Storage Speed	150 TB/s (on-chip)	22 TB/s (HBM4 off-chip)
Storage Capacity	512 MB per chip	288 GB HBM4
Latency Consistency	Fully deterministic (no cache misses)	Cache hierarchy non-deterministic
Applicable Phase	Decode-dedicated	Prefill + Decode general-purpose
Compiler	Static scheduling, zero runtime overhead	CUDA kernel dynamic scheduling

5. Performance Benchmarks and Energy Efficiency Analysis

5.1 Inference Latency and Throughput

Performance data for Groq LPU across various open-source models:

Model	Groq LPU	GPU Comparison	GPU Platform	Speedup	Source
Llama 2 70B	~300 tok/s	~30 tok/s	H100 cluster	~10x	[5]
Llama 3 70B	500-750 tok/s	10-40 tok/s	H100/H200	~15-50x	[30]
Gemma 7B	~814 tok/s	~100 tok/s	GPU	~8x	[32]
Mistral Large	~320 tok/s	~28 tok/s	A100	~11x	[33]
Mixtral 8x7B	~500 tok/s	~40 tok/s	H100	~12x	[34]
Phi-3	3,200 tok/s	~600 tok/s	H100 + vLLM	~5x	[35]
Llama 3 8B	~500-600 tok/s	~80 tok/s	H100	~7x	[34]

It is important to note that these data points originate from multiple sources and varying test conditions, and do not represent A/B testing under a unified benchmark. However, the overall trend is consistent: in single-user/low-batch (batch=1) scenarios, Groq LPU’s speed advantage is most pronounced (10-50x). As batch size increases, GPU utilization rises, narrowing the gap.

5.2 Latency Determinism

A key, often underestimated advantage of Groq is extremely low latency variance [4]:

Time To First Token (TTFT) ~0.22s, largely unaffected by system load
Latency variance per inference run for the same model under the same configuration <5%
GPU system latency variance under identical conditions can reach 30-50%, primarily due to HBM refresh cycles and cache contention

This characteristic is crucial for real-time interactive AI applications (voice assistants, Agentic AI).

5.3 Energy Efficiency

Metric	Groq LPU	GPU (H100)	Ratio
Joules/token	1-3 J [36]	10-30 J [36]	~10x
Energy/1M tokens	1-3 kWh	10-30 kWh	~10x
Single card power	~300W	700W	~0.4x
Single card price	~$20,000 [37]	~$28,000-38,000 [37]	~0.5-0.7x

Groq LPU’s energy efficiency advantage has three physical origins: (1) SRAM read energy is approximately 0.1-0.3 pJ/bit, whereas HBM reads (including TSV + SerDes) are about 5 pJ/bit, a 17-50x difference; (2) the deterministic architecture eliminates waste from speculative execution and cache misses; (3) lower overall TDP.

6. GroqCloud Platform and Pricing Model

GroqCloud launched in February 2024, offering an OpenAI-compatible API [18]. As of 2026, three deployment tiers are available:

Public Cloud (GroqCloud): Token-based billing, with Free/Developer/Enterprise rate tiers
Private/Dedicated Cloud: Custom capacity and key data residency
GroqRack On-Premise: For government, finance, and other regulated industries, air or liquid cooling [38]

6.1 Supported Models and Pricing

Groq supports only open-weight models; it does not host proprietary models such as GPT-5.5, Claude, or Gemini [39]:

Model	Input Price	Output Price
Llama 3.1 8B	$0.05 / M tokens	$0.08 / M tokens
Llama 3.3 70B	$0.59 / M tokens	$0.79 / M tokens
DeepSeek R1 Distill 70B	$0.75 / M	$0.99 / M
Whisper Turbo (Speech)	$0.04 / hour	—
GPT-OSS, Qwen3 32B, Kimi K2	Varies by model	—

Free tier rate limits: 30 RPM, 1K RPD for all models; Llama 70B: 12K TPM, 100K TPD [40]. Batch API offers a 50% discount (24h-7d processing window) [39].

7. Business Model and Financial Analysis

7.1 Revenue Forecast Evolution

Groq’s financial trajectory presents a case study of an AI hardware startup navigating the gap between “revenue illusion and reality”:

Early 2025: Forecast annual revenue of over $2B to investors [41]
July 2025: Sharply revised down to approximately $500M—a 75% cut within three months [41][42]

Primary reasons for the downgrade [43]: (1) Insufficient data center capacity—physical deployment speed of LPU clusters fell behind expectations; (2) Partial revenue from the Saudi agreement deferred to 2026 recognition; (3) Enterprise customer signing pace below forecasts; (4) Chip production yield and delivery cycle challenges.

7.2 Key Financial Metrics

Metric	Value	Conditions/Source
2025 Revenue Forecast	~$500M	Post-downgrade [41]
2023 Net Loss	-$88M	Public data [1]
2024 ARR	~$172M	Latka estimate [45]
GroqCloud Developers	~2 million	Company disclosure [46]
Key Customers	Bell Canada, Aramco Digital, Saudi Arabia [47]	Enterprise contracts
Customer Concentration Risk	Highly dependent on 1-2 Middle Eastern entities	Saudi Arabia accounts for majority of agreement [48]

7.3 Cumulative Funding

Groq raised approximately $1.87B in equity financing from 2017 to 2026, plus a $1.5B infrastructure commitment from Saudi Arabia, totaling approximately $3.37B. Adding the $20B NVIDIA deal consideration, the total valuation of Groq’s technology + talent + assets approximates $23B—though founders and investors received substantial returns, the commercial reality of the company’s independent operational side remained stark [44].

7.4 Saudi Arabia Agreement

The $1.5B commitment was the most critical non-equity funding source during Groq’s independent period [15]. The core of this agreement:

Infrastructure: Build a GroqCloud data center in Dammam—the largest AI inference hub in the EMEA region
Partner: Aramco Digital, providing inference capacity for its Norous voice AI and Allam bilingual model
Strategic Alignment: Aligns with Saudi Vision 2030’s AI economic diversification strategy

This also introduced significant risk: Groq’s lion’s share of revenue and Middle East expansion plans were highly dependent on sustained investment from Saudi Arabia [48].

8. Competitive Landscape Analysis

8.1 Inference Chip Panorama

The AI inference chip market in 2026 has formed a “three-way split” structure:

GPU General-Purpose: NVIDIA H100/B200, AMD MI300X—flexible, CUDA ecosystem, but low inference efficiency
ASIC Inference-Dedicated: Groq LPU, Cerebras WSE-3, SambaNova SN40L, Etched Sohu, MatX—high $ inference performance, 10-100x vs GPU
Hyperscaler Custom: Google TPU v7 (Ironwood, 4,614 TFLOPS/chip), AWS Inferentia, Meta MTIA (four-generation roadmap), Microsoft Maia—vertically integrated, locked-in workloads

8.2 Core Competitor Comparison

Dimension	Groq LPU	Cerebras WSE-3	NVIDIA H100/B200	SambaNova SN40L
Chip Form Factor	Single die ASIC	Wafer-scale (46,225 mm²)	Single die GPU	Multi-die reconfigurable
On-chip SRAM	230/512 MB	44 GB	~50 MB (L2)	Not disclosed
Memory Bandwidth	80/150 TB/s	21 PB/s	3.35 TB/s	Not disclosed
Supports Training	❌	✅	✅	✅
Determinism	Full	Wafer-scale	No	Partial
FP8 Compute	1.2 PFLOPS (v3)	125 PFLOPS	4.5 PFLOPS (B200)	—
Max Model on Single Chip	~1-7B (FP8)	~100B+	~70B	—
Compiler Model	Static scheduling	Wafer mapping	CUDA kernel	Dataflow mapping
Business Model	Inference API + Cloud	Training+Inference Cloud+On-prem	Full stack	Training+Inference

8.3 Cerebras: The Most Direct Comparison

Cerebras WSE-3 shares an SRAM-centric design philosophy with Groq, but differs significantly in scale and capability:

Cerebras Advantages: 44 GB on-chip SRAM vs 230/512 MB—single chip can accommodate 100B+ parameter models without cross-chip data movement; supports both training and inference; customer G42, $20B chip procurement agreement with OpenAI [50]
Groq Advantages: More mature software stack (OpenAI-compatible API), larger developer community (2M vs Cerebras ~100K), lower Time To First Token (TTFT 0.22s vs ~0.5s)

8.4 Industry Consolidation Wave

A series of acquisitions during 2025-2026 indicates that independent inference chip startups face a binary choice of “be acquired vs. be marginalized” [8]:

Groq → NVIDIA ($20B, 2025.12)
Untether AI → AMD acquires engineering team (2025.06)
Enfabrica → NVIDIA acquires (~$900M, 2025.09)
Rivos → Meta acquires (2025.10)
SambaNova → Intel acquisition talks fail; instead, $350M investment + partnership (2026.02)
Cerebras → Remains independent, Nasdaq IPO May 2026, valuation $95B

9. NVIDIA Acquisition

9.1 Deal Structure

Agreement reached on December 24, 2025 [6][7]:

Element	Detail
Consideration	Approximately $20B (all cash)
Nature	Non-exclusive technology license + talent acquisition [6]
NVIDIA Acquires	All Groq patents + software stack + founding team + core engineering
Joining NVIDIA	Jonathan Ross (CEO), Sunny Madra (President), architecture/compiler/systems teams
Groq Independent Entity	Continues operations, retains LPU production and GroqCloud. CEO: Simon Edwards (ex-CFO) → Adam Winter
Largest NVIDIA Deal	Previous record: Mellanox ~$6.9B (2019)

9.2 Five-Layer Strategic Analysis

1. Technology Layer: GPUs are inherently inefficient at autoregressive decode. Each token generation requires reading the full set of weights from HBM (approximately 140 GB for a 70B model); the GPU’s compute-to-bandwidth ratio is severely imbalanced in this scenario. Groq LPU’s 150 TB/s SRAM bandwidth raises this bottleneck 7x from 22 TB/s (HBM4), with fully predictable latency.

2. Product Layer: NVIDIA had originally planned to launch Rubin CPX—an inference accelerator based on GDDR7. However, CPX’s assumed ~2 TB/s GDDR7 bandwidth was utterly uncompetitive against Groq LPU’s 150 TB/s SRAM. Following GTC 2026, CPX has disappeared from NVIDIA’s roadmap [8].

3. Competitive Layer: Had Groq remained independent and been acquired by the Cerebras/OpenAI camp, it would have formed a second inference hardware ecosystem outside of GPU—posing a serious threat to NVIDIA’s long-term market dominance.

4. Platform Layer: The InfiniBand/SmartNIC technology from the 2019 Mellanox acquisition ($6.9B) was later integrated into NVLink and NVSwitch, becoming an inseparable part of NVIDIA’s AI infrastructure. Groq’s deterministic compiler and LPU architecture will follow the same path.

5. Financial Layer: Acquiring proven technology + team for 10% of NVIDIA’s ~$200B annual revenue is efficient in both time and cost compared to internal development (3-5 years + tens of billions in R&D investment).

9.3 Market Reaction

Bernstein analyst Stacy Rasgon noted: the deal’s non-exclusive licensing structure (Groq can license to companies other than NVIDIA) “may keep the fiction of competition alive, while effectively neutralizing a competitor” [52]. This structure also made the deal easier to navigate antitrust review.

AWS has already announced deployment of Groq 3 LPU + over one million NVIDIA GPUs as an extended collaboration at GTC 2026 [8].

10. Limitations, Controversies, and Unresolved Issues

10.1 SRAM Capacity Ceiling

SRAM’s physical density is approximately 1/100 to 1/200 that of DRAM—because an SRAM bit cell typically requires 6 transistors, while DRAM requires only 1 + 1 capacitor. Even with Groq 3 doubling SRAM to 512 MB, there remains a 160-560x capacity gap compared to GPU’s 80-288 GB HBM.

Practical consequences:

Llama 3.1 70B (FP8) requires approximately 60 LP30 chips in parallel
Trillion-parameter models: LPX rack’s 128 GB SRAM can only hold roughly 1/10 of the model, requiring external DDR5 supplementation
Inter-chip communication overhead grows non-linearly with chip count

10.2 No Training Support

The LPU’s determinism and static scheduling make it unusable for model training [53]. Training requires: (1) dynamic backpropagation; (2) gradient accumulation; (3) iterative weight updates—all of which the LPU’s deterministic architecture cannot process. Cerebras WSE-3 holds a significant advantage here: single-chip support for training + inference.

10.3 TruePoint Precision Controversy

In a December 2025 technical blog post, SambaNova claimed that Groq’s low-precision inference exhibited statistically significant accuracy degradation across multiple NLU tasks [22]. Independent third-party verification remains insufficient.

10.4 Ecosystem Limitations

Open-weight models only: Cannot run GPT, Claude, Gemini [39]
Shallow model selection: ~10 hosted models vs AWS’s hundreds
No LoRA/adapter support: Cannot fine-tune or customize
Immature toolchain: Relative to CUDA ecosystem’s 15 years of accumulation

10.5 Independent Entity

Following the NVIDIA deal, the Groq independent entity (raised $650M in May 2026) faces contradictions:

Lost core IP exclusivity: Licensed to NVIDIA
Lost founding team: Ross + Madra + core engineering at NVIDIA
Faces intense competition: Cerebras, Fireworks, Together, OpenRouter all vying for the inference cloud market
But retains key assets: 2 million developers, GroqCloud API, ability to license to other parties if necessary

11. Summary and Future Outlook

11.1 Core Contributions

Groq’s LPU has left three distributed contributions in AI hardware history:

Deterministic Architecture as a Third Path: Beyond GPU (flexible, non-deterministic) and fixed-function ASIC (non-programmable), proved that “programmable + fully deterministic” is a viable path
SRAM-centric Inference Paradigm: Demonstrated the feasibility of on-chip SRAM as primary weight storage (rather than cache), providing a reference for subsequent SRAM-based inference designs
Compiler-Hardware Co-design: Elevated the compiler to a first-class citizen in hardware scheduling, with the compiler possessing complete control over architectural state—this philosophy has influenced Etched, MatX, D-Matrix, and other companies

11.2 Key Insights

Groq’s story—from obscurity in 2016, to viral fame in 2024, to NVIDIA’s $20B acquisition at the end of 2025—reveals several trends in the AI inference hardware market:

The optimal exit for independent inference chip startups is increasingly becoming acquisition by a major platform
Inference disaggregation (prefill + decode assigned to different hardware) will become an industry standard
The physical limits of SRAM capacity mean SRAM-centric designs are suited for specific market segments (ultra-low-latency inference), not as a universal replacement

11.3 Outlook (2026-2030)

2026 Q3: Groq 3 LPX ships, inference disaggregation formally enters productization
2027: LP35 + NVFP4, Rubin Ultra compatibility; AI inference market size approaches $200B [49]; hyperscaler custom + ASIC + GPU three-way split
2028+: LP40 with Feynman architecture; inference ASIC market share projected to grow from ~5% in 2025 to ~20-30% by 2030

References

Groq - Wikipedia. https://en.wikipedia.org/wiki/Groq
Williams, W. (Feb 2024). “Groq’s ultrafast LPU could well be the first LLM-native processor”. TechRadar Pro.
Abts, D.; Ross, J.; et al. (May 2020). “Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads”. ISCA 2020. doi:10.1109/ISCA45697.2020.00023
Groq (Feb 2024). “Groq LPU Inference Engine Crushes First Public LLM Benchmark”. groq.com/blog.
Introl Blog (Jan 2026). “Groq LPU Infrastructure: Ultra-Low Latency AI Inference Guide 2025”.
Nellis, S. (Dec 2025). “Nvidia, joining Big Tech deal spree, to license Groq technology, hire executives”. Reuters.
Silberling, A. (Dec 2025). “Nvidia to license AI chip challenger Groq’s tech and hire its CEO”. TechCrunch.
James, L. (Mar 2026). “How Nvidia’s $20 billion Groq 3 LPU deal reshapes the Nvidia Vera Rubin Platform”. Tom’s Hardware.
LinkedIn. “Jonathan Ross — Chief Software Architect @ Nvidia”. linkedin.com/in/ross-jonathan.
MSN. “Jonathan Ross net worth after $20 billion Nvidia deal”.
Clark, K. (Sep 2018). “Secretive semiconductor startup Groq raises $52M from Social Capital”. TechCrunch.
King, I. (Apr 2021). “Tiger Global, D1 Lead $300 Million Round in AI Chip Startup Groq”. Bloomberg.
Wiggers, K. (Aug 2024). “AI chip startup Groq lands $640M to challenge Nvidia”. TechCrunch.
Reuters (Sep 2025). “Groq more than doubles valuation to $6.9 billion”.
Groq (Feb 2025). “Saudi Arabia Announces $1.5 Billion Expansion to Fuel AI-powered Economy with Groq”.
Silberling, A. (May 2026). “After Nvidia’s $20B not-acqui-hire, AI chip startup Groq reportedly raising $650M”. TechCrunch.
Groq (Mar 2022). “Groq Acquires Dataflow Systems Pioneer Maxeler Technologies”. PRNewswire.
TechCrunch (Mar 2024). “AI chip startup Groq forms new business unit, acquires Definitive Intelligence”.
Williams, W. (Feb 2024). “‘Feels like magic!’ — Groq’s ultrafast LPU”. TechRadar Pro.
Upadhyay, A. (Mar 2024). “The Architecture of Groq’s LPU”. Coding Confessions Blog.
Groq. “TruePoint Technology — Stop Compromising Accuracy for Performance”. GroqDocs.
SambaNova (Dec 2025). “Does reduced precision hurt? A bit about losing bits”. sambanova.ai/blog.
Abts, D.; Kimmell, G.; et al. (Jun 2022). “A software-defined tensor streaming multiprocessor for large-scale ML”. ISCA 2022. doi:10.1145/3470496.3527405
Ward-Foxton, S. (Jan 2020). “Groq’s AI Chip Debuts in the Cloud”. EETimes.
Hwang, J-S. (Aug 2023). “Samsung’s new US chip fab wins first foundry order from Groq”. Korea Economic Daily.
Groq (Aug 2025). “Inside the LPU: Deconstructing Groq’s Speed”. groq.com/blog.
GroqChip Processor Product Brief v1.7 (PDF). groq.sa.
The Register (Mar 2026). “A closer look at Nvidia’s Groq-powered LPX rack systems”.
Awesome Agents (Mar 2026). “Groq LPU — Deterministic Inference at Scale”.
Silicon Analysts (Dec 2024). “Nvidia vs Groq: The Inference Acceleration Battle”.
Markaicode (Jun 2026). “Groq Mixtral H100 Throughput: 480 tok/s on Llama 3 70B”.
Groq (Mar 2024). “Groundbreaking Gemma 7B Performance running on the Groq LPU Inference Engine”.
Markaicode. “Mistral Large on A100 vs Groq LPU: VRAM Benchmark”.
Beebom. “Meet Groq, a Lightning Fast AI Accelerator that Beats ChatGPT & Gemini”.
Markaicode. “Groq vs vLLM on H100: Phi-3 Throughput Hits 3,200 Tokens/Sec”.
Li, Z. “Groq’s Deterministic Architecture is Rewriting the Physics of AI Inference”. Medium.
Silicon Analysts. “Nvidia vs Groq: Cost Analysis Hardware Pricing”.
Groq. “GroqRack Compute Cluster”. groq.com/groqrack.
Groq On-Demand Pricing. https://groq.com/pricing
TokenMix Blog (Apr 2026). “Groq API Pricing 2026: Free Tier, $0.05/M Paid Models”.
The Information (Jul 2025). “Groq slashes 2025 revenue projections to $500 million”.
Investing.com (Jul 2025). “Groq slashes 2025 revenue projections to $500 million”.
TrendForce (Jul 2025). “Groq Cuts 2025 Revenue Projection by USD 1.5B”.
Sacra. “Groq revenue, valuation & funding”. sacra.com/c/groq/.
Latka. “Groq Revenue 2025: $172.5M ARR”. getlatka.com.
Sacra (Feb 2026). “Equity Research Groq” (PDF).
Data Center Dynamics (Feb 2025). “Groq secures $1.5bn from Saudi Arabia”.
Reuters (Jul 2025). “AI chip startup Groq discusses $6 billion valuation”.
Silicon Analysts (Apr 2026). “AMD vs NVIDIA AI GPU Market Share 2026”.
Digitimes (Apr 2026). “Nvidia and OpenAI both make US$20 billion bets on AI chip startups”.
TrendForce (2026). “Custom ASIC shipments from cloud providers growing 44.6% in 2026”.
CNBC (Dec 2025). “Nvidia-Groq deal is structured to keep ‘fiction of competition alive,’ analyst says”.
Cryptonomist (Apr 2026). “NVIDIA pairs Rubin GPUs with Groq LPU to cut latency, boost inference”.

1. Overview#

2. Historical Evolution and Founding Background#

2.1 Founder and TPU#

2.2 Funding History#

2.3 Key Acquisitions#

2.4 Early Strategy: The Unexpected Pivot from CNN to LLM#

3. In-Depth Technical Architecture Analysis#

3.1 Design Philosophy: Determinism First#

3.2 TSP Functionally-Sliced Microarchitecture#

3.3 Streaming Execution Model#

3.4 Compiler and ISA#

3.5 TruePoint Numerical Precision#

3.6 Multi-Chip Scaling: Software-Defined Tensor Streaming Multiprocessor#

4. Generational Evolution and Specification Comparison#

4.1 Complete Generational Specification Table#

4.2 Groq 3 LPX System Specifications#

4.3 Architectural Comparison with NVIDIA GPU#

5. Performance Benchmarks and Energy Efficiency Analysis#

5.1 Inference Latency and Throughput#

5.2 Latency Determinism#

5.3 Energy Efficiency#

6. GroqCloud Platform and Pricing Model#

6.1 Supported Models and Pricing#

7. Business Model and Financial Analysis#

7.1 Revenue Forecast Evolution#

7.2 Key Financial Metrics#

7.3 Cumulative Funding#

7.4 Saudi Arabia Agreement#

8. Competitive Landscape Analysis#

8.1 Inference Chip Panorama#

8.2 Core Competitor Comparison#

8.3 Cerebras: The Most Direct Comparison#

8.4 Industry Consolidation Wave#

9. NVIDIA Acquisition#

9.1 Deal Structure#

9.2 Five-Layer Strategic Analysis#

9.3 Market Reaction#

10. Limitations, Controversies, and Unresolved Issues#

10.1 SRAM Capacity Ceiling#

10.2 No Training Support#

10.3 TruePoint Precision Controversy#

10.4 Ecosystem Limitations#

10.5 Independent Entity#

11. Summary and Future Outlook#

11.1 Core Contributions#

11.2 Key Insights#

11.3 Outlook (2026-2030)#

References#