GPU benchmarks on Lambda’s offering of the NVIDIA H100 SXM5 vs the NVIDIA A100 SXM4 using DeepChat’s 3-step training example.
We’ve run an exciting benchmark test on Lambda’s offering of the NVIDIA H100 SXM5 instance, powered by NVIDIA H100 Tensor Core GPUs, using DeepChat’s 3-step training example. We compare the performance with a reference NVIDIA A100 SXM4 Tensor Core system and stress-test its scalability on a whopping 1,024 GPUs across 128 servers.
The head-to-head comparison between Lambda’s NVIDIA H100 SXM5 and NVIDIA A100 SXM4 instances across the 3-step Reinforcement Learning from Human Feedback (RLHF) Pipeline in FP16 shows:
Step 1 (OPT-13B Zero3): NVIDIA H100 was 2.8x faster.
Step 2 (OPT-350M Zero0): NVIDIA H100 clinched a 2.5x speed advantage.
Step 3 (OPT-13B Zero3 plus OPT-350M Zero0): NVIDIA H100 raced ahead with a 3.1x speed boost.
Testing distributed training scalability:
Large model (OPT-13B) and bigger batches (16 samples/GPU) resulted in a 127.51x throughput from 128 servers.
Smaller model (OPT-350M) and smaller batches (4 samples/GPU) still impressed with a 112.79x throughput from 128 servers.
DeepSpeed training on the NVIDIA H100 SXM5 system shows a speedup of between 2.5x-3.1x compared to on the NVIDIA A100 SXM4 system. Lambda Cloud Clusters feature 80GB NVIDIA H100 SXM5 GPUs, InfiniBand connection with 1:1 NIC-to-GPU ratio, and rail-optimized networking. They can deliver unprecedented performance and scalability on thousands of GPUs.
Learn more and reach out to the Lambda team to reserve your allocation.