Titan V Deep Learning Benchmarks with TensorFlow

March 12, 2019 6 min read

In this post, Lambda Labs benchmarks the Titan V's Deep Learning / Machine Learning performance and compares it to other commonly used GPUs. We use the Titan V to train ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and SSD300. We measure the # of images processed per second while training each network.

A few notes:

We use TensorFlow 1.12 / CUDA 10.0.130 / cuDNN 7.4.1
Single-GPU benchmarks were run on the Lambda's deep learning workstation
Multi-GPU benchmarks were run on the Lambda's PCIe GPU Server
V100 Benchmarks were run on Lambda's SXM3 Tesla V100 Server
Tensor Cores were utilized on all GPUs that have them

Titan V - FP32 TensorFlow Performance (1 GPU)

For FP32 training of neural networks, the NVIDIA Titan V is...

42% faster than RTX 2080
41% faster than GTX 1080 Ti
26% faster than Titan XP
4% faster than RTX 2080 Ti
90% as fast as Titan RTX
75% as fast as Tesla V100 (32 GB)

as measured by the # images processed per second during training.

Titan V - FP16 TensorFlow Performance (1 GPU)

For FP16 training of neural networks, the NVIDIA Titan V is..

111% faster than GTX 1080 Ti
94% faster than Titan XP
70% faster than RTX 2080
23% faster than RTX 2080 Ti
87% as fast as Titan RTX
68% as fast as Tesla V100 (32 GB)

as measured by the # images processed per second during training.

FP32 Multi-GPU Scaling Performance (1, 2, 4, 8 GPUs)

For each GPU type (Titan V, RTX 2080 Ti, RTX 2080, etc.) we measured performance while training with 1, 2, 4, and 8 GPUs on each neural networks and then averaged the results. The chart below provides guidance as to how each GPU scales during multi-GPU training of neural networks in FP32. The chart can be read as follows:

Using eight Titan Vs will be 5.18x faster than using a single Titan V
Using eight Tesla V100s will be 9.68x faster than using a single Titan V
Using eight Tesla V100s is 9.68 / 5.18 = 1.87x faster than using eight Titan Vs

Titan V - FP16 vs. FP32

FP16 can reduce training times and enable larger batch sizes/models without significantly impacting model accuracy. Compared with FP32, FP16 training on the Titan V is...

80% faster on ResNet-50
69% faster on ResNet-152
70% faster on Inception v3
51% faster on Inception v4
96% faster on VGG-16
78% faster on AlexNet
57% faster on SSD300

as measured by the # of images processed per second during training. This gives an average speed-up of +71.6%.

Caveat emptor: If you're new to machine learning or simply testing code, we recommend using FP32. Lowering precision to FP16 may interfere with convergence.

GPU Prices

Titan V: $2,999.00
RTX 2080 Ti: $1,199.00
RTX 2080: $799.00
Titan RTX: $2,499.00
Tesla V100 (32 GB): ~$8,200.00
GTX 1080 Ti: $699.00
Titan Xp: $1,200.00

Methods

For each model we ran 10 training experiments and measured # of images processed per second; we then averaged the results of the 10 experiments.
For each GPU / neural network combination, we used the largest batch size that fit into memory. For example, on ResNet-50, the V100 used a batch size of 192; the RTX 2080 Ti use a batch size of 64.
We used synthetic data, as opposed to real data, to minimize non-GPU related bottlenecks
Multi-GPU training was performed using model-level parallelism

Hardware

Single-GPU training: Lambda Quad - Deep Learning GPU Workstation. CPU: i9-7920X / RAM: 64 GB DDR4 2400 MHz.
Multi-GPU training: Lambda Blade - Deep Learning GPU Server. CPU: Xeon E5-2650 v4 / RAM: 128 GB DDR4 2400 MHz ECC
V100 Benchmarks: Lambda Hyperplane - Tesla V100 Server. CPU: Xeon Gold 6148 / RAM: 256 GB DDR4 2400 MHz ECC

Software

Ubuntu 18.04 (Bionic)
TensorFlow 1.12
CUDA 10.0.130
cuDNN 7.4.1

Run Our Benchmarks On Your Own Machine

Our benchmarking code is on github. We'd love it if you shared the results with us by emailing s@lambdalabs.com or tweeting @LambdaAPI.

Step #1: Clone Benchmark Repository

git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive

Step #2: Run Benchmark

Input a proper gpu_index (default 0) and num_iterations (default 10)

cd lambda-tensorflow-benchmark
./benchmark.sh gpu_index num_iterations

Step #3: Report Results

Check the repo directory for folder <cpu>-<gpu>.logs (generated by benchmark.sh)
Use the same num_iterations in benchmarking and reporting.

./report.sh <cpu>-<gpu>.logs num_iterations

Raw Benchmark Data

FP32: # Images Processed Per Sec During TensorFlow Training (1 GPU)

Model / GPU	RTX 2080 Ti	RTX 2080	Titan RTX	Titan V	V100	Titan Xp	1080 Ti
ResNet-50	294	213	330	300	405	236	209
ResNet-152	110	83	129	107	155	90	81
Inception v3	194	142	221	208	259	151	136
Inception v4	79	56	96	77	112	63	58
VGG16	170	122	195	195	240	154	134
AlexNet	3627	2650	4046	3796	4782	3004	2762
SSD300	149	111	169	156	200	123	108

FP16: # Images Processed Per Sec During TensorFlow Training (1 GPU)

Model / GPU	RTX 2080 Ti	RTX 2080	Titan RTX	Titan V	V100	Titan Xp	1080 Ti
ResNet-50	466	329	612	539	811	289	263
ResNet-152	167	124	234	181	305	104	96
Inception v3	286	203	381	353	494	169	156
Inception v4	106	74	154	116	193	67	62
VGG16	255	178	383	383	511	166	149
AlexNet	4988	3458	6627	6746	8922	3104	2891
SSD300	195	153	292	245	350	136	123