torch.compile

1. Overview

torch.compile is a Just-In-Time (JIT) compilation framework introduced in PyTorch 2.0 (released March 2023), marking PyTorch’s critical transition from pure eager mode execution to compilation-optimized execution. Its core design philosophy is to boost model execution speed by 1.5-2x through automatic graph capture and kernel code generation, while preserving PyTorch’s ultimate Python programmability and debugging flexibility [1][2].

torch.compile is driven by a pipeline composed of three core components: TorchDynamo (Python bytecode-level graph capture frontend), AOTAutograd (backward graph pre-generation for training scenarios), and TorchInductor (the default optimization backend, generating Triton GPU kernels or C++/OpenMP CPU kernels). As of August 2025, a report by Edward Yang (ezyang, core member of the PyTorch compiler team) indicates that 1.5-2x acceleration is the typical performance observed in common scenarios, and torch.compile enables global-level optimizations such as automatic activation checkpointing and asynchronous tensor parallelism [1][3].

As of June 2026, torch.compile has iterated to PyTorch 2.12, supporting Python 3.13, torch.compiler.set_stance for fine-grained performance control, and more [4][5]. At the production level, vLLM has enabled torch.compile by default as a core inference engine component since the V1 architecture [6]; the vast majority of models in the Hugging Face and TIMM model suites achieve substantial acceleration when torch.compile is enabled [7]. Third-party research such as GraphMend further demonstrates that automatically eliminating graph breaks can reduce model inference latency by up to 75% [8].

2. Product Evolution: From TorchScript to the PT2 Compiler Stack

2.1 Predecessor: Lessons from TorchScript

PyTorch introduced TorchScript in version 1.0 (2018), attempting to capture Python models as static graphs for optimized execution via torch.jit.trace and torch.jit.script. However, TorchScript had fundamental limitations: trace could only capture the execution trajectory under a given input path and could not handle control flow; while script could handle control flow, it required user code to adhere to a strict Python subset (not supporting most dynamic Python features, a large number of third-party libraries, numpy interaction, etc.). This led to consistently limited production adoption of TorchScript, with the user community reporting it as “too fragile and difficult to debug” [9][10].

2.2 PyTorch 2.0 (March 2023)

PyTorch 2.0 marked the debut of torch.compile. Its core innovation was TorchDynamo: a Python-level JIT compiler based on the CPython frame evaluation API, capable of observing Python bytecode execution at runtime, automatically extracting tensor computation regions into FX graphs, and performing graceful “graph breaks” (fallback) for uncapturable code segments [1][2][11]. Compared to TorchScript, users only need to add the @torch.compile() decorator to a model/function to obtain acceleration, without modifying the model definition.

PyTorch’s officially published TorchBench benchmarks showed a geometric mean speedup of 1.8-2x across 80+ models for version 2.0 [1][7]. However, the initial 2.0 release had pain points such as long compilation cold start times and imperfect dynamic shape support.

2.3 PyTorch 2.1-2.3 (2023-2024)

2.1 (October 2023): Introduced native support for NumPy programs, torch.compile could automatically understand and compile NumPy code, supporting NumPy execution on CPU and CUDA, as well as gradient backpropagation [12].
2.2 (December 2023): Improved dynamic shape support, enabled dynamic=None mode by default (automatically detects shape changes and attempts dynamic compilation); AOTAutograd min-cut partitioning optimization [13].
2.3 (April 2024): Supported integration of user-defined Triton kernels with torch.compile, PrimTorch operator set normalization [14].

2.4 PyTorch 2.4-2.6 (2024-Early 2025)

2.4 (July 2024): Supported torch.compile under Python 3.12; AOTInductor freezing optimization (MKLDNN weight serialization) [15].
2.5 (October 2024): Regional compilation reduced compilation cold start from 67 seconds to 9.6 seconds (7x improvement); introduced FlexAttention (automatically compiles custom attention mask functions into FlashAttention-level Triton kernels via torch.compile) [16][17].
2.6 (January 2025): Supported Python 3.13; introduced torch.compiler.set_stance performance control interface; AOTInductor supported FP16 x86 CPU; Inductor CUDA Graphs enabled by default [4][5].

2.5 PyTorch 2.7-2.12 (2025-2026)

2.7: CUDA Graph Trees (multi-graph shared memory pool); Inductor supported more operator fusion patterns.
2.8 (August 2025): Numerous vLLM-related upstream improvements; compiler cache sharding by model hash; backed_size_oblivious dynamic shape mode (reducing unnecessary recompilations caused by 0/1 specialization) [6][18].
2.9-2.11 (Late 2025 - Early 2026): Further compilation cold start optimization; Helion project entered Beta (providing a Triton kernel programming layer with native PyTorch interfaces); torch.export stabilization [19].
2.12 (Current stable release, June 2026): Ongoing Dynamo tracing and guard system improvements; smarter recompilation detection; deeper integration with vLLM and NVIDIA AITune.

2.6 Comparison of Key Features Across Versions

Feature	PyTorch 2.0	PyTorch 2.5	PyTorch 2.8	PyTorch 2.12
Release Date	2023-03	2024-10	2025-08	2026-06
Graph Capture Engine	TorchDynamo	Dynamo + Regional Compilation	Dynamo + Improved ephemeral tracing	Dynamo + backed_size_oblivious mode
Default Backend	TorchInductor	Inductor + Triton	Inductor + CUDA Graphs Trees	Inductor + Helion Beta
Dynamic Shape Support	Experimental, manual	Default dynamic=None auto-detection	backed/unbacked dual mode	Three modes: backed/unbacked/backed_size_oblivious
Compilation Cold Start	~67 sec (full model)	~9.6 sec (regional compilation)	~5 sec (cache sharding)	~3-4 sec (warm cache)
Python Support	3.8-3.11	3.9-3.12	3.10-3.13	3.10-3.14
Training Compilation	AOTAutograd min-cut	AOTAutograd + Auto AC	compiled autograd	compiled autograd + AOTInductor training
Attention Mechanism	Standard SDPA	FlexAttention Beta	FlexAttention + chunked attention	FlexAttention + context parallelism
Production Adoption	Experimental	Hugging Face + TIMM	vLLM enabled by default	vLLM + TensorRT + AITune integration

3. Technical Architecture

The overall pipeline of torch.compile follows a “layered compiler” design: progressively lowering from high-level Python code to hardware-level executable code, with each layer solving compiler problems at different granularities [20][21].

3.1 Pipeline Overview

Python Model Code
    │
    ▼
[TorchDynamo] ─── Python Bytecode-Level Graph Capture
    │               Captures tensor computation regions as FX graphs
    │               Handles graph breaks
    │               Installs guards (runtime assumption checks)
    ▼
[AOTAutograd] ─── Joint Forward + Backward Graph Processing
    │               Activated only during training
    │               min-cut partitioning to minimize saved tensors
    │               Generates autograd.Function wrapper
    ▼
[TorchInductor] ── Optimization and Lowering
    │               Operator fusion (pointwise + reduction)
    │               Layout optimization and memory planning
    │               matmul backend selection (cuBLAS/Triton/CUTLASS)
    │               CUDA Graphs automatic partitioning
    ▼
[Triton/C++/OpenMP] ── Kernel Code Generation
    │               GPU: Triton IR → TTIR → TTGIR → LLVM IR → PTX/SASS
    │               CPU: C++ + OpenMP code generation
    ▼
[GPU/CPU Runtime] ── Execution

3.2 TorchDynamo: Python Bytecode-Level Graph Capture

TorchDynamo is the entry point of the entire pipeline and the most fundamental difference between torch.compile and TorchScript. It does not require users to provide static source code; instead, it observes Python execution at runtime by registering a callback hook on the CPython interpreter’s frame evaluation function (_PyEval_EvalFrameDefault) [2][20].

Mechanism:

When a function decorated with @torch.compile is called, Dynamo takes over its frame execution
Dynamo inspects bytecode instructions one by one, extracting sequences involving PyTorch tensor operations into a torch.fx.Graph representation
For untraceable operations (calling non-PyTorch C extensions, I/O operations, data-dependent control flow), Dynamo performs a “graph break” at the current point—submitting the traced graph to the backend for compilation, running the break point in eager mode, and then continuing to trace a new graph
Each compiled graph comes with a guard (runtime assumption check) that verifies whether the assumptions made during compilation (input shape, dtype, device, global variables, etc.) still hold in subsequent calls. If a guard fails, Dynamo will recompile [2][22]

Key Design Advantages:

Dynamo’s bytecode-level tracing means it can handle arbitrary Python code—including control flow, exception handling, side effects—without requiring the code to follow a restricted Python subset. This is the core difference that sets it apart from TorchScript and JAX’s jax.jit (which requires functionally pure code) [2][23].

3.3 AOTAutograd: Pre-Generated Backward Graph

In training scenarios, compiling only the forward graph is insufficient. AOTAutograd is responsible for generating the corresponding backward graph segment from each forward graph segment captured by Dynamo [3][13].

Processing Flow:

Dynamo splits the forward function into several segments, generating an FX graph for each
AOTAutograd applies the autograd mechanism to each forward graph segment, deriving the corresponding backward graph
The min-cut partitioning algorithm is applied to each forward-backward graph pair, determining which intermediate activations must be saved (cannot be recomputed) and which can be recomputed to save memory [3]
Forward-backward pairs are packaged into autograd.Function modules
When the user calls .backward(), the eager mode autograd engine invokes these compiled backward graphs as if calling an atomic op

Important Limitation: Because PyTorch eager autograd does not support incrementally streaming gradients from large backward nodes, gradient updates are deferred and applied all at once at the end of the compiled region. This can be resolved through compiled autograd, but that requires the entire backward process to be compilable [3][13].

3.4 TorchInductor: Optimization and Kernel Code Generation

Inductor is the default backend for torch.compile, receiving FX graphs from Dynamo/AOTAutograd and executing a series of optimization and lowering steps to ultimately generate executable kernel code [20][21].

Core Optimization Capabilities:

Pointwise + Reduction Fusion: Fuses consecutive element-wise operations and reduction operations into a single Triton kernel, eliminating intermediate memory reads/writes. For example, the three separate kernels of matmul → add → relu can be fused into a single kernel [20].
Horizontal Fusion: Merges multiple independent pointwise or reduction operations (when shapes are compatible) for co-scheduling [24].
Matmul Backend Autotuning: Inductor automatically tests three implementations—cuBLAS, Triton templates, and CUTLASS—for each matrix multiplication configuration, selecting the fastest option [24][25]. vLLM testing shows that for matrix multiplications of shape 8x2048x3072, Triton templates are much faster than the default cuBLAS dispatch [6].
Layout Optimization: Analyzes data dependencies to select appropriate tensor memory layouts, reducing transpose and copy operations.

GPU Path: Inductor by default lowers the fused computation graph to Triton IR, which the Triton compiler further processes into TTIR → TTGIR → LLVM IR → PTX/SASS.

CPU Path: Inductor can generate C++ code and parallelize it via OpenMP. PyTorch 2.6 further introduced AOTInductor FP16 support for x86 CPUs [4].

3.5 CUDA Graphs Integration

CUDA Graphs is a low-level technology provided by NVIDIA that can record a series of GPU kernel launches (and their exact memory addresses) as a cudaGraph_t, then replay them with extremely low CPU overhead. However, CUDA Graphs has strict constraints: it must contain only CUDA operations, input tensors must have static memory addresses, and there can be no CPU-side computation [26][27].

Inductor has built-in automatic CUDA Graphs support:

Automatically partitions computation graphs into CUDA Graphs compatible and incompatible segments (e.g., CPU-side logic in attention operations that cannot be captured)
Automatically manages static input buffers
Supports CUDA Graph Trees (multiple graphs sharing a single memory pool, avoiding cross-graph memory fragmentation) [26][28]

vLLM uses Piecewise CUDA Graphs, capturing only the computation segments between attention operations (typically token-wise operations) as graphs, with the attention operations themselves running in eager mode. This approach retains attention flexibility while gaining the low-overhead benefits of CUDA Graphs [6].

4. Performance and Benchmarks

4.1 Official TorchBench Benchmarks

PyTorch’s officially maintained TorchBench benchmark suite, tested across 80+ models, shows a geometric mean speedup of 1.8-2x for torch.compile (compared to eager mode) [1][7]. The speedup distribution by model type is as follows:

Model Category	Representative Models	Speedup	Primary Acceleration Reasons
CNN Vision Models	ResNet-50, EfficientNet	1.3-1.8x	Convolution kernel fusion, vertical operator fusion
Transformer NLP	BERT, RoBERTa	1.5-2.2x	attention + MLP fusion, matmul autotune
LLM (Generative)	GPT-2, LLaMA	1.4-2.0x	KV cache + CUDA Graphs piecewise capture
Image Segmentation	Mask R-CNN	1.2-1.5x	Reduced fusion under compound loss branches
Speech	Wav2Vec2, Whisper	1.3-1.7x	Convolution + self-attention hybrid fusion

4.2 Hugging Face Model Acceleration

torch.compile inference testing of Hugging Face NLP models on AWS Graviton3 CPUs shows that approximately 70% of models achieve 1.2-2.0x acceleration, with some models exceeding 2.5x speedup [29]. The model types with the most significant acceleration are pure Transformer encoders (BERT class) and decoder-only generative models (GPT class).

4.3 Inference vs. Training Acceleration Differences

torch.compile typically achieves more significant acceleration in inference scenarios (1.5-2.5x) due to:

Inference only requires the forward graph, with no backward graph compilation overhead
Inference can stably use CUDA Graphs replay, eliminating kernel launch overhead
Inference can tolerate larger operator fusion granularity

Training scenario acceleration is typically 1.3-1.8x, mainly constrained by:

AOTAutograd needs to compile both forward and backward graphs simultaneously
min-cut partitioning and activation saving strategies add complexity
Hooks in distributed communication (DDP/FSDP) may interact with compiled regions [3][13]

4.4 Compilation Overhead

The core trade-off of JIT compilation is startup time vs. steady-state throughput:

Phase	PyTorch 2.0	PyTorch 2.5	PyTorch 2.12
First Compilation Cold Start	67 sec (full model LLaMA-7B)	9.6 sec (regional compilation)	3-5 sec (compilation cache)
Warm Recompilation	2-5 sec (depends on change scope)	0.5-2 sec	0.2-1 sec
Cache Hit Startup	N/A	~2 sec (disk load)	~0.5 sec
Triton Autotuning	Every compilation	Cached on first compilation	AOTInductor pre-compilation

vLLM’s compilation caching scheme allows sharing compilation artifacts (FX graphs, Triton kernels, etc.) across machines, significantly reducing startup time in auto-scaling scenarios [6].

5. Dynamo Graph Capture and Tracing Modes

5.1 Graph Break Mechanism

Graph breaks are the most important design feature of torch.compile and its fundamental difference from traditional static graph compilers (such as XLA).

When Dynamo encounters an untraceable operation (such as calling print(), torch.save(), calling non-PyTorch C extensions, conditional branches caused by data-dependent control flow, etc.), it does not error out; instead:

Ends the currently traced FX graph at the break point and submits it to the backend
Executes the untraceable operation in eager mode
Continues tracing a new FX graph after the break point

The advantage of this design is graceful degradation rather than all-or-nothing: even if parts of the model code are uncompilable, the rest can still be accelerated. The cost is that each graph break introduces a CPU-GPU synchronization (eager fallback) and loses opportunities for fusion optimization [2][22].

Debugging Tools:

torch._dynamo.explain(func)(*args): Returns number of graphs, number of breaks, and the reason for each break
fullgraph=True: Errors out on any graph break, used to identify and fix all break points
TORCH_LOGS=recompiles: Logs recompilation events
TORCH_TRACE / tlparse: Structured tracing of compilation stages [22][30]

5.2 Guard System and Recompilation

Each compiled FX graph has a set of guards, which are assumption conditions checked at runtime:

# Typical guard example
check_tensor(x, "x"):  # Checks tensor properties
  - type(x) == torch.Tensor
  - x.dtype == torch.float32
  - x.device == torch.device("cuda:0")
  - x.size(0) == 32  # Static shape assumption
  - x.stride() == (64, 1)

If any guard fails on a subsequent call, Dynamo discards the current compilation result and recompiles. Excessive recompilation is one of the primary sources of torch.compile performance issues [3][22].

Dynamic Shape Strategies:

Mode	Setting	Behavior	Applicable Scenarios
Static (default)	`dynamic=None` first call static	Static compilation on first call; automatically triggers dynamic recompilation upon detecting shape change	Training with stable batch sizes
Dynamic=True	`dynamic=True`	Generates dynamic kernels capable of handling varying shapes from the start, avoiding recompilation	Inference with variable-length sequences, variable image sizes
backed_size_oblivious	(PyTorch 2.8+)	Avoids unnecessary recompilation caused by 0/1 specialization, but still has guards	Production deployment, compromise option
mark_dynamic	Manual marking	`torch._dynamo.mark_dynamic(tensor, 0)` forces a specific dimension to be dynamic	Precise control

5.3 Unbacked Dynamic Shapes

Unbacked dynamic shapes are a mode developed in collaboration between vLLM and the PyTorch compiler team, offering the strongest guard guarantee: it guarantees not to add guards on these symbols, while also not performing 0/1 specialization. The trade-off is potentially missing some optimization opportunities (e.g., when contiguity cannot be determined, defaulting to calling contiguous() which introduces a clone). vLLM uses the UNBACKED mode by default in the V1 architecture to ensure no recompilation occurs during service [6][18].

6. Backend Ecosystem: Inductor, Triton, CUDA Graphs

6.1 TorchInductor’s Default Backend Status

Since PyTorch 2.0, Inductor has been the default backend for torch.compile. It can be explicitly specified via torch.compile(model, backend="inductor"). Alternative backends include:

Backend	Description	Applicable Scenarios
`inductor` (default)	Full compilation pipeline: Dynamo → AOTAutograd → Inductor → Triton/C++	GPU/CPU inference and training
`eager`	Dynamo graph capture only, runs with PyTorch eager	Dynamo fault diagnosis
`aot_eager`	Dynamo + AOTAutograd, runs with eager	AOTAutograd fault diagnosis
`cudagraphs`	Captures CUDA Graphs only, no Inductor	CUDA Graphs specific debugging
`tensorrt`	NVIDIA TensorRT backend	Production inference (NVIDIA GPU)
`openvino`	Intel OpenVINO backend	Intel CPU/GPU inference
`xla`	XLA backend (PyTorch/XLA)	TPU training/inference

6.2 Triton: GPU Kernel Code Generation

The Triton language developed by OpenAI is key to the torch.compile GPU path. Compared to writing CUDA directly, Triton provides a higher-level abstraction, automatically handling memory coalescing, intra-SM scheduling, and tiled computation patterns [2][24].

Inductor to Triton lowering flow:

Inductor converts the FX graph into an Intermediate Representation (IR)
Performs fusion decisions on the IR (which operations to merge into a single Triton kernel)
Generates Triton source code for each fused “kernel group”
The Triton compiler lowers Triton IR sequentially to TTIR → TTGIR → LLVM IR → PTX/SASS
The generated PTX/SASS files execute on the GPU

Autotuning: Triton supports configuring multiple tl.autotune candidates for each kernel (such as block size, number of threads, loop unroll factor), benchmarking each at compile time to select the fastest option. vLLM disables autotuning by default in production to reduce first-time compilation time; users can enable tuning for specific static sizes via compile_sizes=[1,2,4,8] [6][24][25].

6.3 Helion: A New Layer Between PyTorch and Triton

Helion (under active development 2025-2026) is a project launched by the PyTorch compiler team, aiming to provide an experience where “writing PyTorch eager code ≈ writing Triton kernels.” Helion allows users to describe custom compute kernels using PyTorch native ops, which are then compiled into efficient Triton kernels via autotuning. As of June 2026, Helion has entered Beta, with a planned official release in October 2026 [1][19].

6.4 Inductor-TensorRT Backend

The TensorRT backend contributed by NVIDIA allows torch.compile to directly generate TensorRT optimized engines. In PyTorch 2.12, the NVIDIA AITune tool further implements automatic mixed-precision graph and CUDA Graphs configuration optimization for torch.compile + TensorRT [30].

6.5 AOTInductor: Pre-Compilation for Production Deployment

AOTInductor is the combination of torch.export + Inductor, implementing ahead-of-time compilation:

torch.export.export(model, args) captures the model as a stable FX graph (without guards and Python dependencies)
AOTInductor compiles the graph into a shared library (.so file)
At deployment, the shared library is directly loaded and executed, with no Python runtime overhead

AOTInductor is particularly valuable in inference scenarios, eliminating compilation latency, Python runtime overhead, and supporting ABI-stable interfaces (binary compatibility across PyTorch versions) [1][31].

7. Challenges and Limitations

7.1 Graph Breaks Remain a Core Pain Point

Although Dynamo’s design makes graph breaks a graceful degradation mechanism rather than a crash, excessive graph breaks remain the number one reported performance issue by users [2][3][22]. Common graph break causes:

Break Cause	Proportion (estimated)	Typical Scenario
Data-dependent control flow	~35%	`if x.sum() < 0: ...`
Non-PyTorch C extension calls	~20%	`torch.save()`, custom C++ op
Python I/O operations	~15%	`print()`, `logging.info()`
Dynamic shape specialization failure	~15%	Intermediate tensor shape varies with input
Other (torch.func transforms, etc.)	~15%	`vmap`/`grad` nesting incompatibility

GraphMend Research (2025-2026): Savini Kashmira et al. proposed GraphMend in arXiv 2509.16248, a high-level compiler technique for automatically eliminating FX graph breaks through code transformations. In tests on Hugging Face models, GraphMend reduced graph breaks to zero, lowered latency by up to 75%, and improved throughput by up to 8% [8].

7.2 Dynamic Shape Handling Still Imperfect

Dynamic shapes (variable-length sequences, variable batch sizes, different resolution images) are one of the most challenging aspects for torch.compile. Even though PyTorch 2.12 offers three dynamic shape modes, Edward Yang explicitly states:

“We can’t guarantee we can compile a model with dynamic shapes.” [3]

Main issues: Dynamo’s guard system is optimized for static shapes; dynamic shapes trigger unpredictable recompilations; Inductor’s Triton kernel generation also relies on shape information to select optimal block sizes and scheduling strategies [22][27].

7.3 Limited Distributed Training Support

torch.compile’s support for distributed training still lags behind JAX:

DDP: torch.compile supports DataDistributedParallel, but requires graph breaks at DDP bucket boundaries to trigger gradient synchronization, which reduces the compiler optimization scope [3][13].
FSDP: Supports FSDP2 via DTensor, but DTensor compilation under dynamic shapes is still imperfect (GitHub issue #159635) [3].
SPMD Compiler: torch.compile does not assume the program is SPMD by default, thus does not automatically remove unused communication operations. GSPMD-style auto-parallelism (AutoParallel) is still under development [3].
Compilation Consistency: In multi-node training, each node compiles independently; if compilation decisions are inconsistent, it may lead to NCCL timeouts [3].

7.4 Compilation Time and Cold Start

The core trade-off of JIT compilation—compilation time vs. execution speed—is particularly prominent in production deployments:

Large-scale training scenarios ( $250k+ cost ) generally cannot accept JIT compilation overhead and require pre-compilation solutions [13].
In auto-scaling inference scenarios (such as vLLM dynamic scaling), cold compilation time directly impacts service startup latency. The vLLM team has listed this as its highest priority improvement item [6].
Caching solutions (disk cache, distributed cache) are the current primary mitigation, but Edward Yang points out that “caching is not an ideal long-term solution for large-scale training” [3].

7.5 Numerical Precision Differences

Compiled results are not guaranteed to be bit-level equivalent to eager mode:

During FP16/BF16 fusion, Inductor does not insert redundant down/up conversion operations, which may lead to precision differences (can be restored via emulate_precision_casts=True) [3]
Triton kernel reduction order differs from cuBLAS, producing minor floating-point rounding differences
matmul backend switching (cuBLAS vs Triton vs CUTLASS) may cause numerical changes

7.6 Pipeline Debugging Complexity

The three-layer compiler pipeline (Dynamo → AOTAutograd → Inductor) means failures can occur at any layer. The official step-by-step isolation approach is:

backend="eager" → Test Dynamo graph capture
backend="aot_eager" → Test AOTAutograd backward tracing
backend="inductor" → Test Inductor compilation

While this layered diagnosis is systematic, the barrier to entry remains high for users unfamiliar with compiler internals [13].

8. JAX XLA vs TensorFlow XLA vs torch.compile

8.1 Compiler Philosophy Comparison

Dimension	torch.compile	JAX (XLA)	TensorFlow (XLA)
Graph Capture	Bytecode-level dynamic capture (Dynamo)	Functional tracing (jax.jit)	Static graph definition (tf.function)
Python Freedom	High (graph break mechanism)	Low (requires functionally pure code)	Medium (tf.function subset)
Default Backend	Inductor → Triton/C++	XLA → HLO/LLVM	XLA → HLO/LLVM
Training Compiler	AOTAutograd (backward pre-generation)	XLA handles entire fwd+bwd automatically	XLA handles automatically
Dynamic Shapes	Gradually improving (three modes)	Inherently supported (XLA dynamic)	Weak support (recompilation)
JIT vs AOT	JIT (default) + AOTInductor	JIT (jax.jit)	Both available
Hardware Support	NVIDIA GPU, AMD GPU, CPU, (TPU via XLA)	GPU, TPU, CPU	NVIDIA GPU, TPU, CPU
OSS Kernel Language	Triton (programmable GPU kernels)	Pallas (similar to Triton)	No automatic kernel generation

8.2 torch.compile vs JAX XLA: Functional Differences

JAX was designed from day one to be “compilation-first”—functionally pure code, immutable tensors, no side effects—which allows the XLA compiler to obtain a complete computation graph, perform aggressive fusion and global optimization [23][32].

torch.compile was designed to be “compatible with eager”—it must handle Python’s mutable semantics, in-place operations, and side effects. This design choice preserves PyTorch’s flexibility and ease of use but limits the aggressive optimizations the compiler can make.

Key Differences:

SPMD Compilation: JAX natively has GSPMD (Generalized SPMD partitioning), automatically mapping a single program to multiple devices, while torch.compile does not assume SPMD by default, requiring manual configuration of DTensor and distributed strategies [3].
Data Types: JAX strictly distinguishes jnp.float32, etc., avoiding the torch.Tensor / Python float mixed inference problems present in PyTorch.
Control Flow: JAX requires control flow to be expressed through explicit structured primitives like lax.cond / lax.scan / lax.while_loop, while torch.compile allows arbitrary Python control flow (at the cost of graph breaks).

8.3 Inductor + Triton vs XLA + HLO

Inductor + Triton and XLA have fundamental differences in their compiler lowering paths:

XLA: Uses HLO (High-Level Optimizer) IR as a hardware-independent intermediate representation, optimizes the entire graph and then lowers it to LLVM/PTX. XLA performs aggressive automatic fusion and layout optimization, but user control over generated code is limited [32].
Inductor + Triton: Inductor performs segmented fusion of the graph and then generates Triton kernel code for each segment. Triton is programmable—users can directly write Triton kernels and integrate them with torch.compile (PyTorch 2.3+) [14][24]. This design provides finer-grained control and makes it easier to adapt to new hardware (AMD GPUs, Intel GPUs, etc.).

Although JAX XLA is more advanced in some dimensions of compiler capability, PyTorch dominates absolutely by virtue of its larger ecosystem:

Research Papers: Approximately 85% of deep learning papers used PyTorch in 2026 [33]
Hugging Face: The vast majority of models are provided in PyTorch format
Production Inference: Mainstream inference engines like vLLM and TensorRT-LLM are PyTorch-centric
Training Infrastructure: torchtitan is recommended by Edward Yang as the starting point for large-scale training [3]

TensorFlow’s market share continues to decline, accounting for approximately 10-15% of the research field in 2026, maintaining presence mainly in specific application scenarios running on TPUs [33].

9. Production Adoption and Future Directions

9.1 Current State of Production Adoption

As of June 2026, torch.compile has achieved significant levels of production adoption:

Field	Adoption Status	Representative Cases
LLM Inference	vLLM V1 enabled by default	Llama, Mistral, Qwen series
Video/Vision Inference	Widely adopted	Stable Diffusion, DINOv2, CLIP
Training	Growing adoption	torchtitan, FSDP2 integration
Enterprise Deployment	torch.export + AOTInductor	Model serving in finance, healthcare
Edge Deployment	ExecuTorch foundation	Mobile devices, embedded systems

9.2 vLLM Deep Integration Experience

vLLM is the deepest production user of torch.compile [6]. Key experiences include:

Compilation Cache Shared Across Machines: vLLM’s ~/.cache/vllm/torch_compile_cache/ is safe to share when all factors (configuration, PyTorch version, model forward code) are identical, allowing warm-up in auto-scaling clusters
Ensuring No Runtime Recompilation: vLLM guarantees all compilation is completed before serving requests, preventing latency spikes caused by requests triggering new compilation
Piecewise CUDA Graphs: Captures only token-wise computation segments between attention operations as CUDA Graphs; attention itself runs in eager mode
Unbacked Dynamic Shapes: Adopts UNBACKED mode by default for the strongest guard guarantee
Custom Compiler Passes: vLLM implements custom passes like SiLU+quantization fusion, AllReduce+RMSNorm fusion, sequence parallelism+async TP, achieving throughput improvements of 8-15% through precision fusion [6]

9.3 Future Directions

According to the PyTorch compiler team and community roadmap, key development directions for torch.compile include:

Precompile: No longer relying on caching mechanisms, but moving compilation entirely ahead of deployment, generating binaries with no Python dependencies. Edward Yang explicitly states that “caching is not an ideal long-term solution for large-scale training” [3].
Helion Official Release: GA planned for after Beta in October 2026, providing a Triton kernel programming layer with native PyTorch interfaces [19].
GSPMD-Level Auto-Parallelism: The AutoParallel project aims to automatically determine sufficiently good sharding strategies (data parallelism, tensor parallelism, expert parallelism), similar to JAX’s GSPMD [3].
Distributed Compilation Consistency: Compile once, broadcast to all nodes, avoiding NCCL timeout issues from independent multi-node compilation [3].
Continuous Compilation Time Optimization: vLLM’s -O0 to -O3 CLI flag restructuring, allowing users to make explicit trade-offs between startup time and performance [6].
FP4 and Lower Precision Fusion: The community is already developing FP4 fusion passes (Attention+Quant FP4, SiLU-Mul+Quant FP4) [6].
Broader Hardware Support: Inductor/Triton adaptation on AMD GPUs (ROCm stack), Intel GPUs (XPU), Apple Silicon (MPS).

10. Summary and Outlook

torch.compile represents the critical evolution of PyTorch from eager mode execution to compilation-optimized execution. Its core innovation—TorchDynamo’s bytecode-level graph capture—solved the long-standing problem that TorchScript failed to address: “how to achieve compilation acceleration while preserving Python flexibility.” The three-layer compiler pipeline design (Dynamo → AOTAutograd → Inductor) addresses different compiler problems at each level, from Python-level graph capture to GPU kernel code generation, forming a complete ML compiler stack.

From its initial release in March 2023 to PyTorch 2.12 in June 2026, torch.compile has made significant progress in compilation cold start time (67 sec → 3-4 sec), dynamic shape support, training compilation support, CUDA Graphs integration, and production deployment paths (AOTInductor). vLLM enabling torch.compile by default in the V1 architecture is an important milestone of its production maturity.

However, torch.compile still faces substantial challenges: performance degradation caused by graph breaks, imperfect dynamic shape handling, gaps compared to JAX in distributed training, compilation time bottlenecks, and numerical precision uncertainty. Third-party research such as GraphMend and vLLM’s custom pass mechanisms indicate that solving these challenges requires collaborative innovation between the compiler team and downstream users.

From a broader competitive perspective, torch.compile occupies a unique position in design philosophy between JAX (purely functional, compilation-first) and traditional PyTorch (fully eager, no compiler)—it attempts to capture the advantages of both. This compromise achieves a practically viable balance between flexibility and performance, but still falls short of JAX XLA in the depth of optimizations the compiler can apply. With the advancement of projects like Helion, Precompile, and AutoParallel, this gap is gradually narrowing.

For teams planning to use torch.compile in production, the current best practices are: use torchtitan as the training base [3], use vLLM’s V1 architecture for inference deployment [6], select dynamic shape modes based on workload characteristics, and proactively identify and fix graph break points via torch._dynamo.explain.

References

ezyang. “State of torch.compile for training (August 2025)”. ezyang’s blog, 2025-08-13. https://blog.ezyang.com/2025/08/state-of-torch-compile-august-2025/
CompilerSutra. “Inside torch.compile: Dynamo → AOTAutograd → Inductor → Triton Explained”. https://www.compilersutra.com/docs/ml-compilers/inside-torch-compile/
ezyang. “State of torch.compile for training (August 2025) — Full text”. https://blog.ezyang.com/2025/08/state-of-torch-compile-august-2025/
PyTorch Team. “PyTorch 2.6 Release Blog”. PyTorch Blog, 2025-01-29. https://pytorch.org/blog/pytorch2-6/
PyTorch GitHub Releases. “PyTorch 2.6.0 Release Notes”. https://github.com/pytorch/pytorch/releases
Luka Govedič, Richard Zou, Addie Stevens, Kaichao You, Michael Goin, Saša Zelenović. “Introduction to torch.compile and How It Works with vLLM”. vLLM Blog, 2025-08-20. https://vllm.ai/blog/2025-08-20-torch-compile
vLLM Documentation. “torch.compile integration”. https://docs.vllm.ai/en/latest/design/torch_compile/
Savini Kashmira, Jayanaka Dantanarayana, Thamirawaran Sathiyalogeswaran, Yichao Yuan, Nishil Talati, Krisztian Flautner, Lingjia Tang. “GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2”. arXiv:2509.16248, 2025-2026. https://arxiv.org/abs/2509.16248
guillesanbri.com. “PyTorch Compilation: From TorchScript to torch.compile”. https://guillesanbri.com/pytorch-compilation/
PyTorch Wikipedia. “PyTorch — TorchScript historical context”. https://en.wikipedia.org/wiki/PyTorch
Soumith Chintala. “PyTorch 2.0 Announcement”. LinkedIn, 2022-12. https://www.linkedin.com/posts/soumith_so-excited-to-introduce-pytorch-20-a-year-activity-7004492936667136000-h4fX
Mark Saroufim. “Frequently Asked Questions — PyTorch torch.compiler FAQ (NumPy support)”. PyTorch Docs, 2025-2026. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_faq.html
Mark Saroufim. “Frequently Asked Questions — PyTorch torch.compiler FAQ”. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_faq.html
PyTorch 2.3 Release. “User-defined Triton kernels in torch.compile”. PyTorch Facebook, 2024-04. https://www.facebook.com/pytorch/posts/426108130046205/
PyTorch 2.4 Release Blog. “AOTInductor freezing, Python 3.12 support”. https://pytorch.org/blog/pytorch2-4/
Sean Kim. “PyTorch 2.5 Release: 7x Faster Compile Cold Start and FlexAttention”. https://blog.imseankim.com/pytorch-2-5-release-compile-mode-improvements-new-features/
PyTorch Blog. “FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention”. https://pytorch.org/blog/flexattention/
vLLM Docs. “Dynamic shapes and vllm guard dropping — backed vs unbacked”. https://docs.vllm.ai/en/latest/design/torch_compile/
ezyang (and PyTorch Compiler Team). “Helion project status”. Referenced in [1] as “beta October 2026”.
CompilerSutra. “Inside torch.compile — The One-Line Picture and Stage Details”. https://www.compilersutra.com/docs/ml-compilers/inside-torch-compile/
depyf Documentation. “A Walk Through Example of torch.compile”. https://depyf.readthedocs.io/en/latest/walk_through.html
PyTorch Documentation. “torch.compile Troubleshooting”. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_troubleshooting.html
gdymind. “jax.jit, torch.compile & CUDA graph — A Comparison”. gdymind’s Blog, 2026-03-07. https://gdymind.com/2026/03/07/jax-jit-torch-compile-CUDA-graph/
PyTorch Developer Mailing List. “Question regarding horizontal fusion”. dev-discuss.pytorch.org, 2025-12. https://dev-discuss.pytorch.org/t/question-regarding-horizontal-fusion/3275
DeepWiki. “Kernel Selection and Autotuning — pytorch/pytorch”. https://deepwiki.com/pytorch/pytorch/2.5.3-kernel-selection-and-autotuning
DeepWiki. “CUDA Graph Capture and Memory Pools — pytorch/pytorch”. https://deepwiki.com/pytorch/pytorch/3.2.2-cuda-graph-capture-and-memory-pools
PyTorch GitHub Issue #121968. “[RFC] Use CUDA graphs by default on torch.compile”. https://github.com/pytorch/pytorch/issues/121968
PyTorch Documentation (Android Git). “torch.compiler_cudagraph_trees.rst”. https://android.googlesource.com/platform/external/pytorch/
PyTorch Blog. “Accelerated PyTorch inference with torch.compile on AWS Graviton”. https://pytorch.org/blog/accelerated-pytorch-inference/
supercharleszhu. “torch-compile-tutorial — Structured trace export (TORCH_TRACE)”. GitHub. https://github.com/supercharleszhu/torch-compile-tutorial
PyTorch Documentation. “AOTInductor: Ahead-Of-Time Compilation for Torch.Export-ed Models”. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_aot_inductor.html
GeneralCompute Blog. “Compiler-Level Optimizations for Inference: TorchInductor, Triton, and XLA”. 2026-05-06. https://www.generalcompute.com/blog/compiler-level-optimizations-for-inference
Tech Insider. “PyTorch vs TensorFlow 2026: 85% Research Share Gap”. 2026-05. https://tech-insider.org/pytorch-vs-tensorflow-2026/
Spheron Network. “PyTorch vs TensorFlow in 2026: Which AI Framework Should You Use?”. 2026-04. https://www.spheron.network/blog/pytorch-vs-tensorflow/
Spheron Network. “torch.compile and CUDA Graphs for LLM Inference on H200 and B200”. https://www.spheron.network/blog/torch-compile-cuda-graphs-llm-inference-pytorch-2-6/

1. Overview#

2. Product Evolution: From TorchScript to the PT2 Compiler Stack#

2.1 Predecessor: Lessons from TorchScript#

2.2 PyTorch 2.0 (March 2023)#

2.3 PyTorch 2.1-2.3 (2023-2024)#

2.4 PyTorch 2.4-2.6 (2024-Early 2025)#

2.5 PyTorch 2.7-2.12 (2025-2026)#

2.6 Comparison of Key Features Across Versions#

3. Technical Architecture#

3.1 Pipeline Overview#

3.2 TorchDynamo: Python Bytecode-Level Graph Capture#

3.3 AOTAutograd: Pre-Generated Backward Graph#

3.4 TorchInductor: Optimization and Kernel Code Generation#

3.5 CUDA Graphs Integration#

4. Performance and Benchmarks#

4.1 Official TorchBench Benchmarks#

4.2 Hugging Face Model Acceleration#

4.3 Inference vs. Training Acceleration Differences#

4.4 Compilation Overhead#

5. Dynamo Graph Capture and Tracing Modes#

5.1 Graph Break Mechanism#

5.2 Guard System and Recompilation#

5.3 Unbacked Dynamic Shapes#

6. Backend Ecosystem: Inductor, Triton, CUDA Graphs#

6.1 TorchInductor’s Default Backend Status#

6.2 Triton: GPU Kernel Code Generation#

6.3 Helion: A New Layer Between PyTorch and Triton#

6.4 Inductor-TensorRT Backend#

6.5 AOTInductor: Pre-Compilation for Production Deployment#

7. Challenges and Limitations#

7.1 Graph Breaks Remain a Core Pain Point#

7.2 Dynamic Shape Handling Still Imperfect#

7.3 Limited Distributed Training Support#

7.4 Compilation Time and Cold Start#

7.5 Numerical Precision Differences#

7.6 Pipeline Debugging Complexity#

8. JAX XLA vs TensorFlow XLA vs torch.compile#

8.1 Compiler Philosophy Comparison#

8.2 torch.compile vs JAX XLA: Functional Differences#

8.3 Inductor + Triton vs XLA + HLO#

8.4 Market Share and Ecosystem#

9. Production Adoption and Future Directions#

9.1 Current State of Production Adoption#

9.2 vLLM Deep Integration Experience#

9.3 Future Directions#

10. Summary and Outlook#

References#