This presentation is a high-level overview of the different types of training regimes that you'll encounter as you move from single GPU to multi GPU to multi node distributed training. It briefly describes where the computation happens, how the gradients are communicated, and how the models are updated and communicated. In addition, a visualization of the system is provided. You can download the full PDF presentation here.
Computation Happens: On one GPU
Gradient transfers: N/A
Model transfers: N/A
Computation Happens: On all GPUs and CPU
Gradient transfers: From GPU to CPU (reduce)
Model transfers: From CPU to GPU (broadcast)
Computation Happens: On all GPUs
Gradient transfers: GPU to GPU during NCCL all-reduce
Model transfers: GPU to GPU during NCCL all-reduce
Computation Happens: On all workers and parameter servers
Gradient transfers: Worker to parameter server (asynchronously)
Model transfers: Parameter server to worker (asynchronously)
Computation Happens: On all workers and parameter servers
Gradient transfers: Worker to parameter server
Model transfers: Parameter server to worker
Computation Happens: On all workers and parameter servers
Gradient transfers: Worker gradient shards to parameter servers
Model transfers: Parameter server model shards to workers
Computation Happens: On all workers
Gradient transfers: Worker transfers gradient to peers during all-reduce
Model transfers: Model “update” happens at the end of multi-node all-reduce operation
When scaling up from a single GPU to a multi-node distributed training cluster, in order to achieve full performance, you'll need to take into consideration HPC-style hardware such as NVLink, InfiniBand networking, and GPUs that support features like GPU Direct RDMA.
Traditionally, transferring memory from a GPU on node 1 to a GPU on node N would involve a memory copy from the GPU to the CPU on node 1, a copy from the CPU on node 1 to the NIC on node 1 and then out the door. GPU Direct RDMA (Remote Direct Memory Access), a feature only available on the TESLA line of cards, you can skip the "double copy" and go directly from GPU memory out the door using an InfiniBand card.
The two data pathways for GPU memory transfer (with and without RDMA) are visualized below:
The Lambda Hyperplane V100 server is designed to enable GPU Direct RDMA and node-to-node communication bandwidth of ~42GB/s. (4 Adapters per Node * 100 Gb/s per Adapter / 8 bits per byte = 50 GB/s theoretical peak.)
The board design above combined with the GPU Direct RDMA capabilities of the InfiniBand adapters and the NVIDIA V100 GPUs allows for the measured bandwidth between nodes to reach 42GB/s, 84% of theoretical peak.
This type of distributed training is useful for large scale image, language, and speech models such as NasNet, BERT, and GPT-2.
Additional thanks to Chuan Li and Steve Clarkson.