1. Overview

torch.compile is a Just-In-Time (JIT) compilation framework introduced in PyTorch 2.0 (released March 2023), marking PyTorch’s critical transition from pure eager mode execution to compilation-optimized execution. Its core design philosophy is to boost model execution speed by 1.5-2x through automatic graph capture and kernel code generation, while preserving PyTorch’s ultimate Python programmability and debugging flexibility [1][2].

torch.compile is driven by a pipeline composed of three core components: TorchDynamo (Python bytecode-level graph capture frontend), AOTAutograd (backward graph pre-generation for training scenarios), and TorchInductor (the default optimization backend, generating Triton GPU kernels or C++/OpenMP CPU kernels). As of August 2025, a report by Edward Yang (ezyang, core member of the PyTorch compiler team) indicates that 1.5-2x acceleration is the typical performance observed in common scenarios, and torch.compile enables global-level optimizations such as automatic activation checkpointing and asynchronous tensor parallelism [1][3].

As of June 2026, torch.compile has iterated to PyTorch 2.12, supporting Python 3.13, torch.compiler.set_stance for fine-grained performance control, and more [4][5]. At the production level, vLLM has enabled torch.compile by default as a core inference engine component since the V1 architecture [6]; the vast majority of models in the Hugging Face and TIMM model suites achieve substantial acceleration when torch.compile is enabled [7]. Third-party research such as GraphMend further demonstrates that automatically eliminating graph breaks can reduce model inference latency by up to 75% [8].


2. Product Evolution: From TorchScript to the PT2 Compiler Stack

2.1 Predecessor: Lessons from TorchScript

PyTorch introduced TorchScript in version 1.0 (2018), attempting to capture Python models as static graphs for optimized execution via torch.jit.trace and torch.jit.script. However, TorchScript had fundamental limitations: trace could only capture the execution trajectory under a given input path and could not handle control flow; while script could handle control flow, it required user code to adhere to a strict Python subset (not supporting most dynamic Python features, a large number of third-party libraries, numpy interaction, etc.). This led to consistently limited production adoption of TorchScript, with the user community reporting it as “too fragile and difficult to debug” [9][10].

2.2 PyTorch 2.0 (March 2023)

PyTorch 2.0 marked the debut of torch.compile. Its core innovation was TorchDynamo: a Python-level JIT compiler based on the CPython frame evaluation API, capable of observing Python bytecode execution at runtime, automatically extracting tensor computation regions into FX graphs, and performing graceful “graph breaks” (fallback) for uncapturable code segments [1][2][11]. Compared to TorchScript, users only need to add the @torch.compile() decorator to a model/function to obtain acceleration, without modifying the model definition.

PyTorch’s officially published TorchBench benchmarks showed a geometric mean speedup of 1.8-2x across 80+ models for version 2.0 [1][7]. However, the initial 2.0 release had pain points such as long compilation cold start times and imperfect dynamic shape support.

2.3 PyTorch 2.1-2.3 (2023-2024)

  • 2.1 (October 2023): Introduced native support for NumPy programs, torch.compile could automatically understand and compile NumPy code, supporting NumPy execution on CPU and CUDA, as well as gradient backpropagation [12].
  • 2.2 (December 2023): Improved dynamic shape support, enabled dynamic=None mode by default (automatically detects shape changes and attempts dynamic compilation); AOTAutograd min-cut partitioning optimization [13].
  • 2.3 (April 2024): Supported integration of user-defined Triton kernels with torch.compile, PrimTorch operator set normalization [14].

2.4 PyTorch 2.4-2.6 (2024-Early 2025)

  • 2.4 (July 2024): Supported torch.compile under Python 3.12; AOTInductor freezing optimization (MKLDNN weight serialization) [15].
  • 2.5 (October 2024): Regional compilation reduced compilation cold start from 67 seconds to 9.6 seconds (7x improvement); introduced FlexAttention (automatically compiles custom attention mask functions into FlashAttention-level Triton kernels via torch.compile) [16][17].
  • 2.6 (January 2025): Supported Python 3.13; introduced torch.compiler.set_stance performance control interface; AOTInductor supported FP16 x86 CPU; Inductor CUDA Graphs enabled by default [4][5].

2.5 PyTorch 2.7-2.12 (2025-2026)

  • 2.7: CUDA Graph Trees (multi-graph shared memory pool); Inductor supported more operator fusion patterns.
  • 2.8 (August 2025): Numerous vLLM-related upstream improvements; compiler cache sharding by model hash; backed_size_oblivious dynamic shape mode (reducing unnecessary recompilations caused by 0/1 specialization) [6][18].
  • 2.9-2.11 (Late 2025 - Early 2026): Further compilation cold start optimization; Helion project entered Beta (providing a Triton kernel programming layer with native PyTorch interfaces); torch.export stabilization [19].
  • 2.12 (Current stable release, June 2026): Ongoing Dynamo tracing and guard system improvements; smarter recompilation detection; deeper integration with vLLM and NVIDIA AITune.

2.6 Comparison of Key Features Across Versions

FeaturePyTorch 2.0PyTorch 2.5PyTorch 2.8PyTorch 2.12
Release Date2023-032024-102025-082026-06
Graph Capture EngineTorchDynamoDynamo + Regional CompilationDynamo + Improved ephemeral tracingDynamo + backed_size_oblivious mode
Default BackendTorchInductorInductor + TritonInductor + CUDA Graphs TreesInductor + Helion Beta
Dynamic Shape SupportExperimental, manualDefault dynamic=None auto-detectionbacked/unbacked dual modeThree modes: backed/unbacked/backed_size_oblivious
Compilation Cold Start~67 sec (full model)~9.6 sec (regional compilation)~5 sec (cache sharding)~3-4 sec (warm cache)
Python Support3.8-3.113.9-3.123.10-3.133.10-3.14
Training CompilationAOTAutograd min-cutAOTAutograd + Auto ACcompiled autogradcompiled autograd + AOTInductor training
Attention MechanismStandard SDPAFlexAttention BetaFlexAttention + chunked attentionFlexAttention + context parallelism
Production AdoptionExperimentalHugging Face + TIMMvLLM enabled by defaultvLLM + TensorRT + AITune integration

3. Technical Architecture

The overall pipeline of torch.compile follows a “layered compiler” design: progressively lowering from high-level Python code to hardware-level executable code, with each layer solving compiler problems at different granularities [20][21].

3.1 Pipeline Overview

Python Model Code
    │
    ▼
[TorchDynamo] ─── Python Bytecode-Level Graph Capture
    │               Captures tensor computation regions as FX graphs
    │               Handles graph breaks
    │               Installs guards (runtime assumption checks)
    ▼
[AOTAutograd] ─── Joint Forward + Backward Graph Processing
    │               Activated only during training
    │               min-cut partitioning to minimize saved tensors
    │               Generates autograd.Function wrapper
    ▼
[TorchInductor] ── Optimization and Lowering
    │               Operator fusion (pointwise + reduction)
    │               Layout optimization and memory planning
    │               matmul backend selection (cuBLAS/Triton/CUTLASS)
    │               CUDA Graphs automatic partitioning
    ▼
[Triton/C++/OpenMP] ── Kernel Code Generation
    │               GPU: Triton IR → TTIR → TTGIR → LLVM IR → PTX/SASS
    │               CPU: C++ + OpenMP code generation
    ▼
[GPU/CPU Runtime] ── Execution

3.2 TorchDynamo: Python Bytecode-Level Graph Capture

TorchDynamo is the entry point of the entire pipeline and the most fundamental difference between torch.compile and TorchScript. It does not require users to provide static source code; instead, it observes Python execution at runtime by registering a callback hook on the CPython interpreter’s frame evaluation function (_PyEval_EvalFrameDefault) [2][20].

Mechanism:

  1. When a function decorated with @torch.compile is called, Dynamo takes over its frame execution
  2. Dynamo inspects bytecode instructions one by one, extracting sequences involving PyTorch tensor operations into a torch.fx.Graph representation
  3. For untraceable operations (calling non-PyTorch C extensions, I/O operations, data-dependent control flow), Dynamo performs a “graph break” at the current point—submitting the traced graph to the backend for compilation, running the break point in eager mode, and then continuing to trace a new graph
  4. Each compiled graph comes with a guard (runtime assumption check) that verifies whether the assumptions made during compilation (input shape, dtype, device, global variables, etc.) still hold in subsequent calls. If a guard fails, Dynamo will recompile [2][22]

Key Design Advantages:

Dynamo’s bytecode-level tracing means it can handle arbitrary Python code—including control flow, exception handling, side effects—without requiring the code to follow a restricted Python subset. This is the core difference that sets it apart from TorchScript and JAX’s jax.jit (which requires functionally pure code) [2][23].

3.3 AOTAutograd: Pre-Generated Backward Graph

In training scenarios, compiling only the forward graph is insufficient. AOTAutograd is responsible for generating the corresponding backward graph segment from each forward graph segment captured by Dynamo [3][13].

Processing Flow:

  1. Dynamo splits the forward function into several segments, generating an FX graph for each
  2. AOTAutograd applies the autograd mechanism to each forward graph segment, deriving the corresponding backward graph
  3. The min-cut partitioning algorithm is applied to each forward-backward graph pair, determining which intermediate activations must be saved (cannot be recomputed) and which can be recomputed to save memory [3]
  4. Forward-backward pairs are packaged into autograd.Function modules
  5. When the user calls .backward(), the eager mode autograd engine invokes these compiled backward graphs as if calling an atomic op

Important Limitation: Because PyTorch eager autograd does not support incrementally streaming gradients from large backward nodes, gradient updates are deferred and applied all at once at the end of the compiled region. This can be resolved through compiled autograd, but that requires the entire backward process to be compilable [3][13].

3.4 TorchInductor: Optimization and Kernel Code Generation

Inductor is the default backend for torch.compile, receiving FX graphs from Dynamo/AOTAutograd and executing a series of optimization and lowering steps to ultimately generate executable kernel code [20][21].

Core Optimization Capabilities:

  • Pointwise + Reduction Fusion: Fuses consecutive element-wise operations and reduction operations into a single Triton kernel, eliminating intermediate memory reads/writes. For example, the three separate kernels of matmul → add → relu can be fused into a single kernel [20].
  • Horizontal Fusion: Merges multiple independent pointwise or reduction operations (when shapes are compatible) for co-scheduling [24].
  • Matmul Backend Autotuning: Inductor automatically tests three implementations—cuBLAS, Triton templates, and CUTLASS—for each matrix multiplication configuration, selecting the fastest option [24][25]. vLLM testing shows that for matrix multiplications of shape 8x2048x3072, Triton templates are much faster than the default cuBLAS dispatch [6].
  • Layout Optimization: Analyzes data dependencies to select appropriate tensor memory layouts, reducing transpose and copy operations.

GPU Path: Inductor by default lowers the fused computation graph to Triton IR, which the Triton compiler further processes into TTIR → TTGIR → LLVM IR → PTX/SASS.

CPU Path: Inductor can generate C++ code and parallelize it via OpenMP. PyTorch 2.6 further introduced AOTInductor FP16 support for x86 CPUs [4].

3.5 CUDA Graphs Integration

CUDA Graphs is a low-level technology provided by NVIDIA that can record a series of GPU kernel launches (and their exact memory addresses) as a cudaGraph_t, then replay them with extremely low CPU overhead. However, CUDA Graphs has strict constraints: it must contain only CUDA operations, input tensors must have static memory addresses, and there can be no CPU-side computation [26][27].

Inductor has built-in automatic CUDA Graphs support:

  • Automatically partitions computation graphs into CUDA Graphs compatible and incompatible segments (e.g., CPU-side logic in attention operations that cannot be captured)
  • Automatically manages static input buffers
  • Supports CUDA Graph Trees (multiple graphs sharing a single memory pool, avoiding cross-graph memory fragmentation) [26][28]

vLLM uses Piecewise CUDA Graphs, capturing only the computation segments between attention operations (typically token-wise operations) as graphs, with the attention operations themselves running in eager mode. This approach retains attention flexibility while gaining the low-overhead benefits of CUDA Graphs [6].


4. Performance and Benchmarks

4.1 Official TorchBench Benchmarks

PyTorch’s officially maintained TorchBench benchmark suite, tested across 80+ models, shows a geometric mean speedup of 1.8-2x for torch.compile (compared to eager mode) [1][7]. The speedup distribution by model type is as follows:

Model CategoryRepresentative ModelsSpeedupPrimary Acceleration Reasons
CNN Vision ModelsResNet-50, EfficientNet1.3-1.8xConvolution kernel fusion, vertical operator fusion
Transformer NLPBERT, RoBERTa1.5-2.2xattention + MLP fusion, matmul autotune
LLM (Generative)GPT-2, LLaMA1.4-2.0xKV cache + CUDA Graphs piecewise capture
Image SegmentationMask R-CNN1.2-1.5xReduced fusion under compound loss branches
SpeechWav2Vec2, Whisper1.3-1.7xConvolution + self-attention hybrid fusion

4.2 Hugging Face Model Acceleration

torch.compile inference testing of Hugging Face NLP models on AWS Graviton3 CPUs shows that approximately 70% of models achieve 1.2-2.0x acceleration, with some models exceeding 2.5x speedup [29]. The model types with the most significant acceleration are pure Transformer encoders (BERT class) and decoder-only generative models (GPT class).

4.3 Inference vs. Training Acceleration Differences

torch.compile typically achieves more significant acceleration in inference scenarios (1.5-2.5x) due to:

  • Inference only requires the forward graph, with no backward graph compilation overhead
  • Inference can stably use CUDA Graphs replay, eliminating kernel launch overhead
  • Inference can tolerate larger operator fusion granularity

Training scenario acceleration is typically 1.3-1.8x, mainly constrained by:

  • AOTAutograd needs to compile both forward and backward graphs simultaneously
  • min-cut partitioning and activation saving strategies add complexity
  • Hooks in distributed communication (DDP/FSDP) may interact with compiled regions [3][13]

4.4 Compilation Overhead

The core trade-off of JIT compilation is startup time vs. steady-state throughput:

PhasePyTorch 2.0PyTorch 2.5PyTorch 2.12
First Compilation Cold Start67 sec (full model LLaMA-7B)9.6 sec (regional compilation)3-5 sec (compilation cache)
Warm Recompilation2-5 sec (depends on change scope)0.5-2 sec0.2-1 sec
Cache Hit StartupN/A~2 sec (disk load)~0.5 sec
Triton AutotuningEvery compilationCached on first compilationAOTInductor pre-compilation

vLLM’s compilation caching scheme allows sharing compilation artifacts (FX graphs, Triton kernels, etc.) across machines, significantly reducing startup time in auto-scaling scenarios [6].


5. Dynamo Graph Capture and Tracing Modes

5.1 Graph Break Mechanism

Graph breaks are the most important design feature of torch.compile and its fundamental difference from traditional static graph compilers (such as XLA).

When Dynamo encounters an untraceable operation (such as calling print(), torch.save(), calling non-PyTorch C extensions, conditional branches caused by data-dependent control flow, etc.), it does not error out; instead:

  1. Ends the currently traced FX graph at the break point and submits it to the backend
  2. Executes the untraceable operation in eager mode
  3. Continues tracing a new FX graph after the break point

The advantage of this design is graceful degradation rather than all-or-nothing: even if parts of the model code are uncompilable, the rest can still be accelerated. The cost is that each graph break introduces a CPU-GPU synchronization (eager fallback) and loses opportunities for fusion optimization [2][22].

Debugging Tools:

  • torch._dynamo.explain(func)(*args): Returns number of graphs, number of breaks, and the reason for each break
  • fullgraph=True: Errors out on any graph break, used to identify and fix all break points
  • TORCH_LOGS=recompiles: Logs recompilation events
  • TORCH_TRACE / tlparse: Structured tracing of compilation stages [22][30]

5.2 Guard System and Recompilation

Each compiled FX graph has a set of guards, which are assumption conditions checked at runtime:

# Typical guard example
check_tensor(x, "x"):  # Checks tensor properties
  - type(x) == torch.Tensor
  - x.dtype == torch.float32
  - x.device == torch.device("cuda:0")
  - x.size(0) == 32  # Static shape assumption
  - x.stride() == (64, 1)

If any guard fails on a subsequent call, Dynamo discards the current compilation result and recompiles. Excessive recompilation is one of the primary sources of torch.compile performance issues [3][22].

Dynamic Shape Strategies:

ModeSettingBehaviorApplicable Scenarios
Static (default)dynamic=None first call staticStatic compilation on first call; automatically triggers dynamic recompilation upon detecting shape changeTraining with stable batch sizes
Dynamic=Truedynamic=TrueGenerates dynamic kernels capable of handling varying shapes from the start, avoiding recompilationInference with variable-length sequences, variable image sizes
backed_size_oblivious(PyTorch 2.8+)Avoids unnecessary recompilation caused by 0/1 specialization, but still has guardsProduction deployment, compromise option
mark_dynamicManual markingtorch._dynamo.mark_dynamic(tensor, 0) forces a specific dimension to be dynamicPrecise control

5.3 Unbacked Dynamic Shapes

Unbacked dynamic shapes are a mode developed in collaboration between vLLM and the PyTorch compiler team, offering the strongest guard guarantee: it guarantees not to add guards on these symbols, while also not performing 0/1 specialization. The trade-off is potentially missing some optimization opportunities (e.g., when contiguity cannot be determined, defaulting to calling contiguous() which introduces a clone). vLLM uses the UNBACKED mode by default in the V1 architecture to ensure no recompilation occurs during service [6][18].


6. Backend Ecosystem: Inductor, Triton, CUDA Graphs

6.1 TorchInductor’s Default Backend Status

Since PyTorch 2.0, Inductor has been the default backend for torch.compile. It can be explicitly specified via torch.compile(model, backend="inductor"). Alternative backends include:

BackendDescriptionApplicable Scenarios
inductor (default)Full compilation pipeline: Dynamo → AOTAutograd → Inductor → Triton/C++GPU/CPU inference and training
eagerDynamo graph capture only, runs with PyTorch eagerDynamo fault diagnosis
aot_eagerDynamo + AOTAutograd, runs with eagerAOTAutograd fault diagnosis
cudagraphsCaptures CUDA Graphs only, no InductorCUDA Graphs specific debugging
tensorrtNVIDIA TensorRT backendProduction inference (NVIDIA GPU)
openvinoIntel OpenVINO backendIntel CPU/GPU inference
xlaXLA backend (PyTorch/XLA)TPU training/inference

6.2 Triton: GPU Kernel Code Generation

The Triton language developed by OpenAI is key to the torch.compile GPU path. Compared to writing CUDA directly, Triton provides a higher-level abstraction, automatically handling memory coalescing, intra-SM scheduling, and tiled computation patterns [2][24].

Inductor to Triton lowering flow:

  1. Inductor converts the FX graph into an Intermediate Representation (IR)
  2. Performs fusion decisions on the IR (which operations to merge into a single Triton kernel)
  3. Generates Triton source code for each fused “kernel group”
  4. The Triton compiler lowers Triton IR sequentially to TTIR → TTGIR → LLVM IR → PTX/SASS
  5. The generated PTX/SASS files execute on the GPU

Autotuning: Triton supports configuring multiple tl.autotune candidates for each kernel (such as block size, number of threads, loop unroll factor), benchmarking each at compile time to select the fastest option. vLLM disables autotuning by default in production to reduce first-time compilation time; users can enable tuning for specific static sizes via compile_sizes=[1,2,4,8] [6][24][25].

6.3 Helion: A New Layer Between PyTorch and Triton

Helion (under active development 2025-2026) is a project launched by the PyTorch compiler team, aiming to provide an experience where “writing PyTorch eager code ≈ writing Triton kernels.” Helion allows users to describe custom compute kernels using PyTorch native ops, which are then compiled into efficient Triton kernels via autotuning. As of June 2026, Helion has entered Beta, with a planned official release in October 2026 [1][19].

6.4 Inductor-TensorRT Backend

The TensorRT backend contributed by NVIDIA allows torch.compile to directly generate TensorRT optimized engines. In PyTorch 2.12, the NVIDIA AITune tool further implements automatic mixed-precision graph and CUDA Graphs configuration optimization for torch.compile + TensorRT [30].

6.5 AOTInductor: Pre-Compilation for Production Deployment

AOTInductor is the combination of torch.export + Inductor, implementing ahead-of-time compilation:

  1. torch.export.export(model, args) captures the model as a stable FX graph (without guards and Python dependencies)
  2. AOTInductor compiles the graph into a shared library (.so file)
  3. At deployment, the shared library is directly loaded and executed, with no Python runtime overhead

AOTInductor is particularly valuable in inference scenarios, eliminating compilation latency, Python runtime overhead, and supporting ABI-stable interfaces (binary compatibility across PyTorch versions) [1][31].


7. Challenges and Limitations

7.1 Graph Breaks Remain a Core Pain Point

Although Dynamo’s design makes graph breaks a graceful degradation mechanism rather than a crash, excessive graph breaks remain the number one reported performance issue by users [2][3][22]. Common graph break causes:

Break CauseProportion (estimated)Typical Scenario
Data-dependent control flow~35%if x.sum() < 0: ...
Non-PyTorch C extension calls~20%torch.save(), custom C++ op
Python I/O operations~15%print(), logging.info()
Dynamic shape specialization failure~15%Intermediate tensor shape varies with input
Other (torch.func transforms, etc.)~15%vmap/grad nesting incompatibility

GraphMend Research (2025-2026): Savini Kashmira et al. proposed GraphMend in arXiv 2509.16248, a high-level compiler technique for automatically eliminating FX graph breaks through code transformations. In tests on Hugging Face models, GraphMend reduced graph breaks to zero, lowered latency by up to 75%, and improved throughput by up to 8% [8].

7.2 Dynamic Shape Handling Still Imperfect

Dynamic shapes (variable-length sequences, variable batch sizes, different resolution images) are one of the most challenging aspects for torch.compile. Even though PyTorch 2.12 offers three dynamic shape modes, Edward Yang explicitly states:

“We can’t guarantee we can compile a model with dynamic shapes.” [3]

Main issues: Dynamo’s guard system is optimized for static shapes; dynamic shapes trigger unpredictable recompilations; Inductor’s Triton kernel generation also relies on shape information to select optimal block sizes and scheduling strategies [22][27].

7.3 Limited Distributed Training Support

torch.compile’s support for distributed training still lags behind JAX:

  • DDP: torch.compile supports DataDistributedParallel, but requires graph breaks at DDP bucket boundaries to trigger gradient synchronization, which reduces the compiler optimization scope [3][13].
  • FSDP: Supports FSDP2 via DTensor, but DTensor compilation under dynamic shapes is still imperfect (GitHub issue #159635) [3].
  • SPMD Compiler: torch.compile does not assume the program is SPMD by default, thus does not automatically remove unused communication operations. GSPMD-style auto-parallelism (AutoParallel) is still under development [3].
  • Compilation Consistency: In multi-node training, each node compiles independently; if compilation decisions are inconsistent, it may lead to NCCL timeouts [3].

7.4 Compilation Time and Cold Start

The core trade-off of JIT compilation—compilation time vs. execution speed—is particularly prominent in production deployments:

  • Large-scale training scenarios ( $250k+ cost ) generally cannot accept JIT compilation overhead and require pre-compilation solutions [13].
  • In auto-scaling inference scenarios (such as vLLM dynamic scaling), cold compilation time directly impacts service startup latency. The vLLM team has listed this as its highest priority improvement item [6].
  • Caching solutions (disk cache, distributed cache) are the current primary mitigation, but Edward Yang points out that “caching is not an ideal long-term solution for large-scale training” [3].

7.5 Numerical Precision Differences

Compiled results are not guaranteed to be bit-level equivalent to eager mode:

  • During FP16/BF16 fusion, Inductor does not insert redundant down/up conversion operations, which may lead to precision differences (can be restored via emulate_precision_casts=True) [3]
  • Triton kernel reduction order differs from cuBLAS, producing minor floating-point rounding differences
  • matmul backend switching (cuBLAS vs Triton vs CUTLASS) may cause numerical changes

7.6 Pipeline Debugging Complexity

The three-layer compiler pipeline (Dynamo → AOTAutograd → Inductor) means failures can occur at any layer. The official step-by-step isolation approach is:

  1. backend="eager" → Test Dynamo graph capture
  2. backend="aot_eager" → Test AOTAutograd backward tracing
  3. backend="inductor" → Test Inductor compilation

While this layered diagnosis is systematic, the barrier to entry remains high for users unfamiliar with compiler internals [13].


8. JAX XLA vs TensorFlow XLA vs torch.compile

8.1 Compiler Philosophy Comparison

Dimensiontorch.compileJAX (XLA)TensorFlow (XLA)
Graph CaptureBytecode-level dynamic capture (Dynamo)Functional tracing (jax.jit)Static graph definition (tf.function)
Python FreedomHigh (graph break mechanism)Low (requires functionally pure code)Medium (tf.function subset)
Default BackendInductor → Triton/C++XLA → HLO/LLVMXLA → HLO/LLVM
Training CompilerAOTAutograd (backward pre-generation)XLA handles entire fwd+bwd automaticallyXLA handles automatically
Dynamic ShapesGradually improving (three modes)Inherently supported (XLA dynamic)Weak support (recompilation)
JIT vs AOTJIT (default) + AOTInductorJIT (jax.jit)Both available
Hardware SupportNVIDIA GPU, AMD GPU, CPU, (TPU via XLA)GPU, TPU, CPUNVIDIA GPU, TPU, CPU
OSS Kernel LanguageTriton (programmable GPU kernels)Pallas (similar to Triton)No automatic kernel generation

8.2 torch.compile vs JAX XLA: Functional Differences

JAX was designed from day one to be “compilation-first”—functionally pure code, immutable tensors, no side effects—which allows the XLA compiler to obtain a complete computation graph, perform aggressive fusion and global optimization [23][32].

torch.compile was designed to be “compatible with eager”—it must handle Python’s mutable semantics, in-place operations, and side effects. This design choice preserves PyTorch’s flexibility and ease of use but limits the aggressive optimizations the compiler can make.

Key Differences:

  • SPMD Compilation: JAX natively has GSPMD (Generalized SPMD partitioning), automatically mapping a single program to multiple devices, while torch.compile does not assume SPMD by default, requiring manual configuration of DTensor and distributed strategies [3].
  • Data Types: JAX strictly distinguishes jnp.float32, etc., avoiding the torch.Tensor / Python float mixed inference problems present in PyTorch.
  • Control Flow: JAX requires control flow to be expressed through explicit structured primitives like lax.cond / lax.scan / lax.while_loop, while torch.compile allows arbitrary Python control flow (at the cost of graph breaks).

8.3 Inductor + Triton vs XLA + HLO

Inductor + Triton and XLA have fundamental differences in their compiler lowering paths:

  • XLA: Uses HLO (High-Level Optimizer) IR as a hardware-independent intermediate representation, optimizes the entire graph and then lowers it to LLVM/PTX. XLA performs aggressive automatic fusion and layout optimization, but user control over generated code is limited [32].
  • Inductor + Triton: Inductor performs segmented fusion of the graph and then generates Triton kernel code for each segment. Triton is programmable—users can directly write Triton kernels and integrate them with torch.compile (PyTorch 2.3+) [14][24]. This design provides finer-grained control and makes it easier to adapt to new hardware (AMD GPUs, Intel GPUs, etc.).

8.4 Market Share and Ecosystem

Although JAX XLA is more advanced in some dimensions of compiler capability, PyTorch dominates absolutely by virtue of its larger ecosystem:

  • Research Papers: Approximately 85% of deep learning papers used PyTorch in 2026 [33]
  • Hugging Face: The vast majority of models are provided in PyTorch format
  • Production Inference: Mainstream inference engines like vLLM and TensorRT-LLM are PyTorch-centric
  • Training Infrastructure: torchtitan is recommended by Edward Yang as the starting point for large-scale training [3]

TensorFlow’s market share continues to decline, accounting for approximately 10-15% of the research field in 2026, maintaining presence mainly in specific application scenarios running on TPUs [33].


9. Production Adoption and Future Directions

9.1 Current State of Production Adoption

As of June 2026, torch.compile has achieved significant levels of production adoption:

FieldAdoption StatusRepresentative Cases
LLM InferencevLLM V1 enabled by defaultLlama, Mistral, Qwen series
Video/Vision InferenceWidely adoptedStable Diffusion, DINOv2, CLIP
TrainingGrowing adoptiontorchtitan, FSDP2 integration
Enterprise Deploymenttorch.export + AOTInductorModel serving in finance, healthcare
Edge DeploymentExecuTorch foundationMobile devices, embedded systems

9.2 vLLM Deep Integration Experience

vLLM is the deepest production user of torch.compile [6]. Key experiences include:

  1. Compilation Cache Shared Across Machines: vLLM’s ~/.cache/vllm/torch_compile_cache/ is safe to share when all factors (configuration, PyTorch version, model forward code) are identical, allowing warm-up in auto-scaling clusters
  2. Ensuring No Runtime Recompilation: vLLM guarantees all compilation is completed before serving requests, preventing latency spikes caused by requests triggering new compilation
  3. Piecewise CUDA Graphs: Captures only token-wise computation segments between attention operations as CUDA Graphs; attention itself runs in eager mode
  4. Unbacked Dynamic Shapes: Adopts UNBACKED mode by default for the strongest guard guarantee
  5. Custom Compiler Passes: vLLM implements custom passes like SiLU+quantization fusion, AllReduce+RMSNorm fusion, sequence parallelism+async TP, achieving throughput improvements of 8-15% through precision fusion [6]

9.3 Future Directions

According to the PyTorch compiler team and community roadmap, key development directions for torch.compile include:

  • Precompile: No longer relying on caching mechanisms, but moving compilation entirely ahead of deployment, generating binaries with no Python dependencies. Edward Yang explicitly states that “caching is not an ideal long-term solution for large-scale training” [3].
  • Helion Official Release: GA planned for after Beta in October 2026, providing a Triton kernel programming layer with native PyTorch interfaces [19].
  • GSPMD-Level Auto-Parallelism: The AutoParallel project aims to automatically determine sufficiently good sharding strategies (data parallelism, tensor parallelism, expert parallelism), similar to JAX’s GSPMD [3].
  • Distributed Compilation Consistency: Compile once, broadcast to all nodes, avoiding NCCL timeout issues from independent multi-node compilation [3].
  • Continuous Compilation Time Optimization: vLLM’s -O0 to -O3 CLI flag restructuring, allowing users to make explicit trade-offs between startup time and performance [6].
  • FP4 and Lower Precision Fusion: The community is already developing FP4 fusion passes (Attention+Quant FP4, SiLU-Mul+Quant FP4) [6].
  • Broader Hardware Support: Inductor/Triton adaptation on AMD GPUs (ROCm stack), Intel GPUs (XPU), Apple Silicon (MPS).

10. Summary and Outlook

torch.compile represents the critical evolution of PyTorch from eager mode execution to compilation-optimized execution. Its core innovation—TorchDynamo’s bytecode-level graph capture—solved the long-standing problem that TorchScript failed to address: “how to achieve compilation acceleration while preserving Python flexibility.” The three-layer compiler pipeline design (Dynamo → AOTAutograd → Inductor) addresses different compiler problems at each level, from Python-level graph capture to GPU kernel code generation, forming a complete ML compiler stack.

From its initial release in March 2023 to PyTorch 2.12 in June 2026, torch.compile has made significant progress in compilation cold start time (67 sec → 3-4 sec), dynamic shape support, training compilation support, CUDA Graphs integration, and production deployment paths (AOTInductor). vLLM enabling torch.compile by default in the V1 architecture is an important milestone of its production maturity.

However, torch.compile still faces substantial challenges: performance degradation caused by graph breaks, imperfect dynamic shape handling, gaps compared to JAX in distributed training, compilation time bottlenecks, and numerical precision uncertainty. Third-party research such as GraphMend and vLLM’s custom pass mechanisms indicate that solving these challenges requires collaborative innovation between the compiler team and downstream users.

From a broader competitive perspective, torch.compile occupies a unique position in design philosophy between JAX (purely functional, compilation-first) and traditional PyTorch (fully eager, no compiler)—it attempts to capture the advantages of both. This compromise achieves a practically viable balance between flexibility and performance, but still falls short of JAX XLA in the depth of optimizations the compiler can apply. With the advancement of projects like Helion, Precompile, and AutoParallel, this gap is gradually narrowing.

For teams planning to use torch.compile in production, the current best practices are: use torchtitan as the training base [3], use vLLM’s V1 architecture for inference deployment [6], select dynamic shape modes based on workload characteristics, and proactively identify and fix graph break points via torch._dynamo.explain.


References

  1. ezyang. “State of torch.compile for training (August 2025)”. ezyang’s blog, 2025-08-13. https://blog.ezyang.com/2025/08/state-of-torch-compile-august-2025/
  2. CompilerSutra. “Inside torch.compile: Dynamo → AOTAutograd → Inductor → Triton Explained”. https://www.compilersutra.com/docs/ml-compilers/inside-torch-compile/
  3. ezyang. “State of torch.compile for training (August 2025) — Full text”. https://blog.ezyang.com/2025/08/state-of-torch-compile-august-2025/
  4. PyTorch Team. “PyTorch 2.6 Release Blog”. PyTorch Blog, 2025-01-29. https://pytorch.org/blog/pytorch2-6/
  5. PyTorch GitHub Releases. “PyTorch 2.6.0 Release Notes”. https://github.com/pytorch/pytorch/releases
  6. Luka Govedič, Richard Zou, Addie Stevens, Kaichao You, Michael Goin, Saša Zelenović. “Introduction to torch.compile and How It Works with vLLM”. vLLM Blog, 2025-08-20. https://vllm.ai/blog/2025-08-20-torch-compile
  7. vLLM Documentation. “torch.compile integration”. https://docs.vllm.ai/en/latest/design/torch_compile/
  8. Savini Kashmira, Jayanaka Dantanarayana, Thamirawaran Sathiyalogeswaran, Yichao Yuan, Nishil Talati, Krisztian Flautner, Lingjia Tang. “GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2”. arXiv:2509.16248, 2025-2026. https://arxiv.org/abs/2509.16248
  9. guillesanbri.com. “PyTorch Compilation: From TorchScript to torch.compile”. https://guillesanbri.com/pytorch-compilation/
  10. PyTorch Wikipedia. “PyTorch — TorchScript historical context”. https://en.wikipedia.org/wiki/PyTorch
  11. Soumith Chintala. “PyTorch 2.0 Announcement”. LinkedIn, 2022-12. https://www.linkedin.com/posts/soumith_so-excited-to-introduce-pytorch-20-a-year-activity-7004492936667136000-h4fX
  12. Mark Saroufim. “Frequently Asked Questions — PyTorch torch.compiler FAQ (NumPy support)”. PyTorch Docs, 2025-2026. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_faq.html
  13. Mark Saroufim. “Frequently Asked Questions — PyTorch torch.compiler FAQ”. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_faq.html
  14. PyTorch 2.3 Release. “User-defined Triton kernels in torch.compile”. PyTorch Facebook, 2024-04. https://www.facebook.com/pytorch/posts/426108130046205/
  15. PyTorch 2.4 Release Blog. “AOTInductor freezing, Python 3.12 support”. https://pytorch.org/blog/pytorch2-4/
  16. Sean Kim. “PyTorch 2.5 Release: 7x Faster Compile Cold Start and FlexAttention”. https://blog.imseankim.com/pytorch-2-5-release-compile-mode-improvements-new-features/
  17. PyTorch Blog. “FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention”. https://pytorch.org/blog/flexattention/
  18. vLLM Docs. “Dynamic shapes and vllm guard dropping — backed vs unbacked”. https://docs.vllm.ai/en/latest/design/torch_compile/
  19. ezyang (and PyTorch Compiler Team). “Helion project status”. Referenced in [1] as “beta October 2026”.
  20. CompilerSutra. “Inside torch.compile — The One-Line Picture and Stage Details”. https://www.compilersutra.com/docs/ml-compilers/inside-torch-compile/
  21. depyf Documentation. “A Walk Through Example of torch.compile”. https://depyf.readthedocs.io/en/latest/walk_through.html
  22. PyTorch Documentation. “torch.compile Troubleshooting”. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_troubleshooting.html
  23. gdymind. “jax.jit, torch.compile & CUDA graph — A Comparison”. gdymind’s Blog, 2026-03-07. https://gdymind.com/2026/03/07/jax-jit-torch-compile-CUDA-graph/
  24. PyTorch Developer Mailing List. “Question regarding horizontal fusion”. dev-discuss.pytorch.org, 2025-12. https://dev-discuss.pytorch.org/t/question-regarding-horizontal-fusion/3275
  25. DeepWiki. “Kernel Selection and Autotuning — pytorch/pytorch”. https://deepwiki.com/pytorch/pytorch/2.5.3-kernel-selection-and-autotuning
  26. DeepWiki. “CUDA Graph Capture and Memory Pools — pytorch/pytorch”. https://deepwiki.com/pytorch/pytorch/3.2.2-cuda-graph-capture-and-memory-pools
  27. PyTorch GitHub Issue #121968. “[RFC] Use CUDA graphs by default on torch.compile”. https://github.com/pytorch/pytorch/issues/121968
  28. PyTorch Documentation (Android Git). “torch.compiler_cudagraph_trees.rst”. https://android.googlesource.com/platform/external/pytorch/
  29. PyTorch Blog. “Accelerated PyTorch inference with torch.compile on AWS Graviton”. https://pytorch.org/blog/accelerated-pytorch-inference/
  30. supercharleszhu. “torch-compile-tutorial — Structured trace export (TORCH_TRACE)”. GitHub. https://github.com/supercharleszhu/torch-compile-tutorial
  31. PyTorch Documentation. “AOTInductor: Ahead-Of-Time Compilation for Torch.Export-ed Models”. https://docs.pytorch.org/docs/2.12/user_guide/torch_compiler/torch.compiler_aot_inductor.html
  32. GeneralCompute Blog. “Compiler-Level Optimizations for Inference: TorchInductor, Triton, and XLA”. 2026-05-06. https://www.generalcompute.com/blog/compiler-level-optimizations-for-inference
  33. Tech Insider. “PyTorch vs TensorFlow 2026: 85% Research Share Gap”. 2026-05. https://tech-insider.org/pytorch-vs-tensorflow-2026/
  34. Spheron Network. “PyTorch vs TensorFlow in 2026: Which AI Framework Should You Use?”. 2026-04. https://www.spheron.network/blog/pytorch-vs-tensorflow/
  35. Spheron Network. “torch.compile and CUDA Graphs for LLM Inference on H200 and B200”. https://www.spheron.network/blog/torch-compile-cuda-graphs-llm-inference-pytorch-2-6/