Deep Learning GPU Benchmarks

GPU training speeds using PyTorch/TensorFlow for computer vision (CV), NLP, text-to-speech (TTS), etc.

PyTorch GPU Benchmarks

TensorFlow GPU Benchmarks

GPU Benchmark Methodology

To measure the relative effectiveness of GPUs when it comes to training neural networks we’ve chosen training throughput as the measuring stick. Training throughput measures the number of samples (e.g. tokens, images, etc...) processed per second by the GPU.

Using throughput instead of Floating Point Operations per Second (FLOPS) brings GPU performance into the realm of training neural networks. Training throughput is strongly correlated with time to solution — since with high training throughput, the GPU can run a dataset more quickly through the model and teach it faster.

In order to maximize training throughput it’s important to saturate GPU resources with large batch sizes, switch to faster GPUs, or parallelize training with multiple GPUs. Additionally, it’s also important to test throughput using state of the art (SOTA) model implementations across frameworks as it can be affected by model implementation.

TensorFlow Logo


Our TensorFlow benchmarks use Google's official model implementations to test performance. The latest CUDA & CuDNN drivers are installed using Lambda Stack. See our TensorFlow testing repo.

PyTorch Logo


Our PyTorch benchmarks use the NVIDIA Docker deep learning software stack. These PyTorch implementations, updated monthly, are highly optimized for the Tensor Core architecture. See our PyTorch testing repo.