torch.compile 实测：加速之前先确认图能不能稳定下来-酒店常州论坛

torch.compile 实测：加速之前先确认图能不能稳定下来

torch.compile为 PyTorch 训练和推理带来了新的优化空间，但它不是无条件加速按钮。动态图、频繁变化的 shape、Python 控制流、未支持算子，都可能导致编译开销大于收益。实测时不能只看某一次 iteration 变快，要看 warmup、重编译次数、显存变化和端到端吞吐。

使用torch.compile前，先确认计算图能不能稳定下来。

一、编译收益来自图优化

flowchart TD A[PyTorch Eager] --> B[Graph Capture] B --> C[Optimization] C --> D[Code Generation] D --> E[Compiled Execution] E --> F{Shape Changed} F -->|yes| B F -->|no| E

如果输入 shape 经常变化，系统可能频繁重新编译。此时单次执行看似可优化，整体吞吐反而下降。

二、基准测试要跳过 warmup

编译第一次运行通常很慢，不能把它和稳定执行混在一起算平均值。

import torch import time model = torch.compile(model) for _ in range(10): _ = model(x) torch.cuda.synchronize() start = time.time() for _ in range(100): _ = model(x) torch.cuda.synchronize() print("avg latency", (time.time() - start) / 100)

训练场景还要测 backward 和 optimizer step，不要只测 forward。

三、动态 shape 要单独记录

NLP 任务里序列长度变化很常见。如果 batch padding 策略不稳定，compile 可能频繁触发新图。

compile_eval: batch_size: 16 seq_len_buckets: [128, 256, 512] metrics: - compile_count - tokens_per_second - max_memory_allocated

可以通过长度分桶降低 shape 变化。训练管线层面的稳定，有时比编译器参数更重要。

四、失败时要能回退

torch.compile可能遇到不支持算子或数值差异。上线推理或长期训练前，要保留 eager fallback，并做输出一致性检查。

with torch.no_grad(): y_eager = eager_model(x) y_compiled = compiled_model(x) diff = (y_eager - y_compiled).abs().max().item() print(diff)

差异不一定必须为零，但要在任务可接受范围内。尤其是混合精度训练，数值差异要谨慎评估。

五、总结

torch.compile的收益依赖稳定计算图、合适的 shape 策略和可支持算子。评测时要跳过 warmup，记录重编译次数、吞吐、显存和数值一致性。

它是值得尝试的优化工具，但不是一行代码解决所有性能问题。先测，再决定是否进入训练或推理主路径。

如果实验记录里没有重编译次数和 shape 分布，单纯报告加速比例是不完整的。

企业官网建设流程全解析