graph LR
A(("🤗 Accelerate#32;"))
A --> B["CLI Interface#32;"]
A --> C["Training Library#32;"]
A --> D["Big Model<br>Inference#32;"]
General estimate (bert-base-cased, 108M params):
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|---|---|---|---|---|---|
| float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
| float16 | 413.18 MB* | 619.77 MB | 826.36 MB | 826.36 MB | 826.36 MB |
*All estimations were based off the Model Estimator Tool
This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).
But what happens as we scale?
Here’s llama-3-8B (8.03B parameters)
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|---|---|---|---|---|---|
| float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
| float16 | 28.21 GB* | 42.32 GB | 56.43 GB | 56.43 GB | 56.43 GB |
Well, I don’t have 56GB of GPU memory in a single card, let alone 112GB.
What can we do?
sharding_strategyFULL_SHARD: Includes optimizer states, gradients, and parametersSHARD_GRAD_OP: Includes optimizer states and gradientsNO_SHARD: Normal DDPHYBRID_SHARD: Includes optimizer states, gradients, and parameters but each node has the full modelauto_wrap_policy:TRANSFORMER_BASED_WRAP or SIZE_BASED_WRAPTRANSFORMER/fsdp_transformers_layer_cls_to_wrap:
transformers has good defaultsSIZE/fsdp_min_num_param:
offload_params:Case: FFT of Llama-3-8B with
fsdp_offload_paramson 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100
cpu_ram_efficient_loading and sync_module_statesmeta device to load in the model to the GPU in a low-ram scenariomodel_size * n_gpus RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via sync_module_statesaxolotlfastaiFastChatlucidrainskorniaAre you using it and you don’t even know?
graph LR
A(("🤗 Accelerate#32;"))
A --> B["CLI Interface#32;"]
A --> C["Training Library#32;"]
A --> D["Big Model<br>Inference#32;"]
accelerate config
accelerate estimate-memory
accelerate launch
How can we make this better?
accelerate launchaccelerate configconfig.yaml filesaccelerate config or write your own:fsdp_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
accelerator.prepare(
dataloader, model, optimizer, scheduler
)
)
for batch in dataloader:
optimizer.zero_grad()
inputs, targets = batch
# inputs = inputs.to(device)
# targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss) # loss.backward()
optimizer.step()
scheduler.step()n nodes, we instead split itn GPUs at a time per “global step”autocast to convert the gradients automatically.bf16() weights, you are STUCK in bf16 perminantly| Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
|---|---|---|---|---|---|---|
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
| Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
| MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 |
| MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 |
| MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 |
| Framework | Model Loading (torch_dtype) |
Mixed Precision | Preparation (Local) | Training | Optimizer (Local) |
|---|---|---|---|---|---|
| FSDP | bf16 | default (none) | bf16 | bf16 | bf16 |
| FSDP | bf16 | bf16 | fp32 | bf16 | fp32 |
| DeepSpeed | bf16 | bf16 | fp32 | bf16 | fp32 |
To learn more, check out the documentation or join my office hours
accelerate, FSDP, and DeepSpeed across multiple GPUs to train bigger modelsFP8 can help speed up training some and reduce computational overhead