Scaling Model Training with More Compute, How Do They Do It?

Who am I?

Zachary Mueller
Technical Lead for the 🤗 Accelerate project
API design geek

Understanding GPU Usage

We can somewhat estimate the memory usage in vanilla full-fine-tuning of models
Requires certain assumptions (that I’ll be covering):
- Adam optimizer
- Batch size of 1

Understanding GPU Usage

General estimate (bert-base-cased, 108M params):

Each parameter is 4 bytes
Backward ~= 2x the model size
The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):

dtype	Model	Gradients	Backward pass	Optimizer step	Highest
float32	413.18 MB	413.18 MB	826.36 MB	1.61 GB	1.61 GB
float16	413.18 MB*	619.77 MB	826.36 MB	826.36 MB	826.36 MB

*All estimations were based off the Model Estimator Tool

Understanding GPU Usage

This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).

But what happens as we scale?

Here’s llama-3-8B (8.03B parameters)

dtype	Model	Gradients	Backward pass	Optimizer step	Highest
float32	28.21 GB	28.21 GB	56.43 GB	112.84 GB	112.84 GB
float16	28.21 GB*	42.32 GB	56.43 GB	56.43 GB	56.43 GB

Well, I don’t have 56GB of GPU memory in a single card, let alone 112GB.

What can we do?

Distributed Training

Kinds of Training

Single GPU:
- No distributed techniques at play
Distributed Data Parallelism (DDP):
- A full copy of the model exists on each device, but data is chunked between each GPU
Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS):
- Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs

Fully Sharded Data Parallelism

FSDP: Getting parameter specific

Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs
These include how model weights are sharded, gradients, and more.
I’ll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B without PEFT on 2x4090’s

`sharding_strategy`

Dictates the level of divving resources to perform
- FULL_SHARD: Includes optimizer states, gradients, and parameters
- SHARD_GRAD_OP: Includes optimizer states and gradients
- NO_SHARD: Normal DDP
- HYBRID_SHARD: Includes optimizer states, gradients, and parameters but each node has the full model
```
FULL_SHARD:
  Parameters, Gradients, Optimizer States: All are sharded.
  Parameters Handling: Unshard before forward pass, reshard after forward pass, unshard before backward pass, reshard after backward pass.
  Gradients Handling: Synchronize and shard after backward pass.
  Optimizer States: Updated locally per rank.
```
SHARD_GRAD_OP: Gradients and Optimizer States: Sharded during computation. Parameters: Unshard before forward pass, remain unsharded during forward pass, reshard after backward pass. Inside no_sync(): Parameters are not resharded after backward computation. Optimizer States: Updated locally per rank.

NO_SHARD: Parameters, Gradients, Optimizer States: Not sharded, replicated across ranks. Gradients Handling: Synchronized via all-reduce after backward pass. Optimizer States: Updated locally per rank.

HYBRID_SHARD: Parameters, Gradients, Optimizer States: Combines FULL_SHARD within a node and replicates parameters across nodes. Communication: Expensive operations like all-gathers and reduce-scatters are limited to within a node, enhancing performance for medium-sized models.

`auto_wrap_policy`:

How the model should be split
Can be either TRANSFORMER_BASED_WRAP or SIZE_BASED_WRAP
TRANSFORMER/fsdp_transformers_layer_cls_to_wrap:
- Need to declare the layer
- Generally transformers has good defaults
SIZE/fsdp_min_num_param:
- Number of total parameters in a shard

`offload_params`:

Offloads the parameters and gradients to the CPU if they can’t fit into memory
Allows you to train much larger models locally, but will be much slower

Case: FFT of Llama-3-8B with fsdp_offload_params on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100

`cpu_ram_efficient_loading` and `sync_module_states`

Uses the idea behind big model inference/the meta device to load in the model to the GPU in a low-ram scenario
Rather than needing model_size * n_gpus RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via sync_module_states

Tying this to 🤗 Accelerate

So far we’ve covered the theory, but how do we put it into practice
By using a library that’s at the heart of the entire open-source ecosystem

Nearly all of 🤗
axolotl
fastai
FastChat
lucidrains
kornia

Are you using it and you don’t even know?

What is 🤗 Accelerate?

graph LR
    A(("🤗 Accelerate#32;"))
    A --> B["CLI Interface#32;"]
    A --> C["Training Library#32;"]
    A --> D["Big Model<br>Inference#32;"]

A CLI Interface

accelerate config
- Configure the environment
accelerate estimate-memory
- How to guess vRAM requirements
accelerate launch
- How to run your script

Launching distributed training is hard

```
python script.py
```

torchrun --nnodes=1 --nproc_per_node=2 script.py

```
deepspeed --num_gpus=2 script.py
```

How can we make this better?

`accelerate launch`

accelerate launch script.py

`accelerate config`

Rely on config.yaml files
Choose to either running accelerate config or write your own:

ddp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8

fsdp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8

A Training Library

A Training Library: The Code

# For alignment purposes
for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    optimizer.step()
    scheduler.step()

from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
    accelerator.prepare(
        dataloader, model, optimizer, scheduler
    )
)

for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    # inputs = inputs.to(device)
    # targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss) # loss.backward()
    optimizer.step()
    scheduler.step()

A Training Library: How Scaling Works

Accelerate’s DataLoaders and schedulers work off of a sharding mindset
Rather than repeating the same data across n nodes, we instead split it
Speeds up training linearly
Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2
This also means the scheduler will be stepped n GPUs at a time per “global step”

A Training Library: Mixed Precision

This may be a bit different than your “normal” idea of mixed precision.
We do not convert the model weights to BF16/FP16
Instead we wrap the forward pass with autocast to convert the gradients automatically
This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on.
If you use .bf16() weights, you are STUCK in bf16 perminantly

A Training Library: Mixed Precision

Let’s tie that back up to the model estimator with neat tools like NVIDIA’s TransformerEngine

Optimization Level	Computation (GEMM)	Comm	Weight	Master Weight	Weight Gradient	Optimizer States
FP16 AMP	FP16	FP32	FP32	N/A	FP32	FP32+FP32
Nvidia TE	FP8	FP32	FP32	N/A	FP32	FP32+FP32
MS-AMP O1	FP8	FP8	FP16	N/A	FP8	FP32+FP32
MS-AMP O2	FP8	FP8	FP16	N/A	FP8	FP8+FP16
MS-AMP O3	FP8	FP8	FP8	FP16	FP8	FP8+FP16

DeepSpeed vs Fully Sharded Data Parallelism

Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation

Framework	Model Loading (`torch_dtype`)	Mixed Precision	Preparation (Local)	Training	Optimizer (Local)
FSDP	bf16	default (none)	bf16	bf16	bf16
FSDP	bf16	bf16	fp32	bf16	fp32
DeepSpeed	bf16	bf16	fp32	bf16	fp32

To learn more, check out the documentation or join my office hours

Key Takeaways:

You can scale out training with accelerate, FSDP, and DeepSpeed across multiple GPUs to train bigger models
Techniques like FP8 can help speed up training some and reduce computational overhead
Comes at a cost of end-precision and locking model weights for futher fine-tunes if not careful

Scaling Model Training with More Compute, How Do They Do It?

Who am I?

Understanding GPU Usage

Understanding GPU Usage

Understanding GPU Usage

Distributed Training

Kinds of Training

Fully Sharded Data Parallelism

Fully Sharded Data Parallelism

FSDP: Getting parameter specific

sharding_strategy

auto_wrap_policy:

offload_params:

cpu_ram_efficient_loading and sync_module_states

Tying this to 🤗 Accelerate

Tying this to 🤗 Accelerate

What is 🤗 Accelerate?

A CLI Interface

Launching distributed training is hard

accelerate launch

accelerate config

A Training Library

A Training Library: The Code

A Training Library: How Scaling Works

A Training Library: Mixed Precision

A Training Library: Mixed Precision

DeepSpeed vs Fully Sharded Data Parallelism

Key Takeaways:

Some Handy Resources

`sharding_strategy`

`auto_wrap_policy`:

`offload_params`:

`cpu_ram_efficient_loading` and `sync_module_states`

`accelerate launch`

`accelerate config`