词元之母TOK.MOM - 平台充值汇率 1:1 即 1 人民币充值到账 1 美元,支持一个 Key 调用近 600+ 海内外模型,限时特价模型低至 1 折,欢迎上岸!
| 来源 | 可选 — 通过 hermes skills install official/mlops/torchtitan 安装 |
| 路径 | optional-skills/mlops/torchtitan |
| 版本 | 1.0.0 |
| 作者 | Orchestra Research |
| 许可证 | MIT |
| 依赖 | torch>=2.6.0, torchtitan>=0.2.0, torchao>=0.5.0 |
| 平台 | linux, macos |
| 标签 | Model Architecture, Distributed Training, TorchTitan, FSDP2, Tensor Parallel, Pipeline Parallel, Context Parallel, Float8, Llama, Pretraining |
单节点预训练:
- [ ] 步骤 1:下载 tokenizer
- [ ] 步骤 2:配置训练
- [ ] 步骤 3:启动训练
- [ ] 步骤 4:监控与检查点# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"
[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer]
name = "AdamW"
lr = 3e-4
[lr_scheduler]
warmup_steps = 200
[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"
[parallelism]
data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500./outputs/tb/:多节点训练:
- [ ] 步骤 1:为规模配置并行度
- [ ] 步骤 2:设置 SLURM 脚本
- [ ] 步骤 3:提交作业
- [ ] 步骤 4:从检查点恢复[parallelism]
data_parallel_shard_degree = 32 # FSDP across 32 ranks
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 1 # No PP for 70B
context_parallel_degree = 1 # Increase for long sequencesFloat8 训练:
- [ ] 步骤 1:安装 torchao
- [ ] 步骤 2:配置 Float8
- [ ] 步骤 3:启动并开启 compile[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"] # Exclude output layer
[compile]
enable = true
components = ["model", "loss"]4D 并行(FSDP + TP + PP + CP):
- [ ] 步骤 1:创建种子检查点
- [ ] 步骤 2:配置 4D 并行
- [ ] 步骤 3:在 512 个 GPU 上启动[parallelism]
data_parallel_shard_degree = 8 # FSDP
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 8 # PP across nodes
context_parallel_degree = 1 # CP for long sequences
[training]
local_batch_size = 32
seq_len = 8192[activation_checkpoint]
mode = "full" # Instead of "selective"
[training]
local_batch_size = 1[training]
local_batch_size = 1
global_batch_size = 32 # Accumulates gradients[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]| 模型 | 规模 | 状态 |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | 生产可用 |
| Llama 4 | 多种 | 实验性 |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | 实验性 |
| GPT-OSS | 20B, 120B (MoE) | 实验性 |
| Qwen 3 | 多种 | 实验性 |
| Flux | 扩散模型 | 实验性 |
| 模型 | GPU 数 | 并行策略 | TPS/GPU | 技术 |
|---|---|---|---|---|
| Llama 8B | 8 | FSDP | 5,762 | 基线 |
| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D 并行 |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D 并行 |