Rock Zhang

torch.compile

1. Overview torch.compile is a Just-In-Time (JIT) compilation framework introduced in PyTorch 2.0 (released March 2023), marking PyTorch’s critical transition from pure eager mode execution to compilation-optimized execution. Its core design philosophy is to boost model execution speed by 1.5-2x through automatic graph capture and kernel code generation, while preserving PyTorch’s ultimate Python programmability and debugging flexibility [1][2]. torch.compile is driven by a pipeline composed of three core components: TorchDynamo (Python bytecode-level graph capture frontend), AOTAutograd (backward graph pre-generation for training scenarios), and TorchInductor (the default optimization backend, generating Triton GPU kernels or C++/OpenMP CPU kernels). As of August 2025, a report by Edward Yang (ezyang, core member of the PyTorch compiler team) indicates that 1.5-2x acceleration is the typical performance observed in common scenarios, and torch.compile enables global-level optimizations such as automatic activation checkpointing and asynchronous tensor parallelism [1][3]. ...

Groq and LPU

1. Overview Groq, Inc. is an AI chip company headquartered in Mountain View, California, founded in 2016 by former Google TPU core designer Jonathan Ross [1]. The company’s processor architecture was initially called the Tensor Streaming Processor (TSP), later rebranded as the Language Processing Unit (LPU) during the large language model wave of 2023-2024 [2]. Groq’s core philosophy rests on a radical and elegant design choice: discard all non-deterministic hardware mechanisms accumulated over forty years in the computing industry, and hand execution scheduling authority entirely to the compiler [3]. In traditional CPUs and GPUs, cache hierarchies, branch prediction, out-of-order execution, and dynamic scheduling are core mechanisms for boosting average performance, but they also introduce latency unpredictability. Groq’s design team realized that for inference workloads—whose computation graphs are known and fixed at runtime—these mechanisms are not merely superfluous, but actively harmful. ...

Cerebras

1. Overview Wafer-Scale Integration (WSI) is not an original concept of Cerebras. In 1980, Gene Amdahl, the father of the IBM mainframe, founded Trilogy Systems, attempting to manufacture an entire wafer as a single processor. Trilogy raised $230 million from entities including IBM and Sperry Rand — the largest startup financing in Silicon Valley history at the time — but during prototype testing, the entire wafer short-circuited upon power-up and burned to a dim red glow, metal wiring layers delaminated, and the thermal solution failed completely. Combined with a devastating fab flood and the sudden death of the company president, along with Amdahl himself being seriously injured in a car accident, Trilogy ended in total failure five years after its founding. In the same period, Texas Instruments, ITT, and the U.S. National Security Agency (NSA) all attempted the WSI route, but the shared conclusion was: manufacturing a commercial wafer-scale chip would require 99.99% fabrication yield — something considered impossible to achieve for at least 100 years at the time. ...

HBM

1. Overview High Bandwidth Memory (HBM) is a computer memory interface technology for 3D-stacked synchronous dynamic random-access memory (SDRAM), jointly developed by Samsung, AMD, and SK Hynix. In October 2013, HBM was adopted as an industry standard (JESD235) by JEDEC. The technology vertically stacks multiple DRAM dies, using Through-Silicon Vias (TSVs) and microbumps to achieve inter-layer interconnection, and is then tightly coupled with a GPU or accelerator via a silicon interposer, thereby delivering data transfer bandwidth tens of times greater than traditional DDR/GDDR solutions with far smaller volume and lower power consumption. ...

NVIDIA SXM

NVIDIA SXM 1. SXM Overview SXM stands for Server PCI Express Module, which is NVIDIA’s proprietary high-bandwidth GPU socket/connector solution designed for mounting data-center-class GPU accelerators directly onto server motherboards. Core Design Philosophy Proprietary: SXM is NVIDIA’s closed proprietary interface standard with undisclosed specifications (requires NDA, Non-Disclosure Agreement), giving NVIDIA complete design freedom High Bandwidth: Direct GPU-to-GPU interconnection via NVLink, with bandwidth far exceeding PCIe High Power: Not limited by the PCIe standard 75W/300W limits; directly powered through the socket up to 700W-1400W+ ...