NVIDIA A100 GPU & DGX A100 Server - Deep Learning Benchmark Estimates

Lambda customers are starting to ask about the new Tesla A100 GPU and our Hyperplane A100 server. To answer the big question upfront:

For FP32, we estimate that the Tesla A100 will be...

  • 265% faster than the RTX 2080
  • 262% faster than the GTX 1080 Ti
  • 229% faster than the Titan XP
  • 93% faster than the RTX 2080 Ti
  • 67% faster than the Titan RTX
  • 86% faster than the Titan V
  • 40% faster than the Tesla V100

See the TensorFlow Deep Learning Benchmark table for more details.

This post covers three main topics:

  • The Tesla A100's estimated deep learning performance.
  • An overview of TensorFloat-32, NVIDIA's new floating point representation.
  • The DGX A100's estimated deep learning performance.

Tesla A100 vs Tesla V100

Tesla A100 Tesla V100
FP32 Peak Theoretical Performance 19.5 TFLOPS 15.7 TFLOPS
FP32 Actual Performance (MatMul) ~18.1 TFLOPS¹ 14.6 TFLOPS
Die Size (mm²) 826 sq. mm 815 sq. mm
Process Node TSMC 7nm TSMC 12nm FFN
TDP 400 W 300 W
GFLOPS PER W 45.2 48.6
FP32 Speedup over V100 (MatMul) ~1.2x² 1x
TensorFloat-32 Speedup ~1.16x³ N/A
Deep Learning Speedup over V100 ~1.4x⁴ 1x
  1. 18.1 TFLOPs is derived as follows: The V100's actual performance is ~93% of its peak theoretical performance (14.6 TFLOPS / 15.7 TFLOPS). If the Tesla A100 follows this pattern, its actual performance will be 18.1 TFLOPS (93% of 19.5 TFLOPS).
  2. 1.2x = 18.1 TFLOPS (est. of the A100's actual perf.) / 14.6 TFLOPS (the V100s actual perf.)
  3. The new TensorFloat-32 (TF32) format will boost speed by ~1.16x; this is the median speedup observed by Google when switching from FP32 to bfloat16. bfloat16 has the same 8-bit floating point exponent as NVIDIA's TensorFloat-32 format. Sources: Google, NVIDIA.
  4. 1.4x = 1.2x (FP32 speedup) * 1.16x (TensorFloat-32 speedup).

TensorFloat-"32" - a 19-bit!? representation

TensorFloat-32 (TF32) is a 19-bit floating point representation that's natively supported by the A100's tensor core elements. A TF32 number looks like this:

  • 8-bit exponent (similar to standard FP32 and bfloat16)
  • 10-bit mantissa (similar to standard FP16)
  • 1-bit sign.

No code changes required

Your TensorFlow/PyTorch code will still use FP32. However, within the tensor cores, these numbers are converted to TF32. Here's how it works:

  1. The tensor cores will receive IEEE 754 FP32 numbers.
  2. The tensor cores will convert the FP32 numbers into TF32 by reducing the mantissa to 10-bits.
  3. The multiply step will be performed in TF32.
  4. The accumulate step will be performed in standard FP32, resulting in an IEEE 754 FP32 tensor output.

A Move to 400 W Cards, an End of an Era

The release of the 400 watt A100 (and hints at a 450 watt version) mark the end of the 300 watt flagship GPU era. This is the logical conclusion given the increased power that we've seen from the DGX-2H, which sported 16x 450 watt overclocked V100s. A cynical take is that we have reached the end of "simple scaling" silicon progress. Now, all that's left to do is turn up the temperature. Prepare for 50 kW racks and exotic data center cooling.

The DGX A100 Server: 8x A100s, 2x AMD EPYC CPUs, and PCIe Gen 4

In addition to the Ampere architecture and A100 GPU that was announced, NVIDIA also announced the new DGX A100 server. The server is the first generation of the DGX series to use AMD CPUs. One of the most important changes comes in the form of PCIe Gen 4 support provided by the AMD EPYC CPUs. This allows for the use of Mellanox 200 Gbps HDR InfiniBand interconnects. 16 lanes of PCIe Gen 3 has a peak bandwidth of 16 GB/s while 16 lanes of PCIe Gen 4 offers twice the bandwidth, 32 GB/s.

This is important for three reasons:

  1. CPU to GPU communication will be twice as fast when compared with previous generations of servers.
  2. PCIe Gen 4 enables the use of the new Mellanox 200 Gbps HDR InfiniBand interconnects. HDR Infiniband (200 Gbps) provides a peak bandwidth of 25 GB/s. As you can see: 32 GB/s provided by PCIe Gen 4 > 25 GB/s required by Mellanox HDR Infiniband > 16 GB/s provided PCIe Gen 3.
  3. The new GPUDirect Storage feature, which allows GPUs to read directly from NVMe drives, will be able to support twice as many NVMe drives using PCIe Gen 4 when compared with PCIe Gen 3. For more info on GPUDirect Storage, see NVIDIA's post here: https://devblogs.nvidia.com/gpudirect-storage/.

Let's take a look at the DGX A100 server side by side with the DGX-1 and Lambda Hyperplane-8 V100.

The new NVIDIA DGX A100 server.
DGX A100 DGX-1 Lambda Hyperplane-8 V100
Theoretical FP32 TFLOPS 156 TFLOPS 125.6 TFLOPS 125.6 TFLOPS
Measured FP32 MatMul Perf. TBD 116.8 TFLOPS 116.8 TFLOPS
Percent Peak Performance TBD 93% 93%
Approx. FP32 MatMul Perf. (Using V100 % Peak) 144.8 TFLOPS 116.8 TFLOPS 116.8 TFLOPS
PCIe Generation 4 3 3
Interconnect Max Speed 200 Gbps InfiniBand HDR 100 Gbps InfiniBand EDR 100 Gbps InfiniBand EDR
Interconnect Count 9 4 4
Total Interconnect Theoretical Throughput (Gbps) 1800 Gbps 400 Gbps 400 Gbps
TDP (W) 6500 W 3500 W 3500 W
Rack Units 6 U 4 U 4 U
MSRP Price $200,000 $120,000 $95,000
FP32 GFLOPS / $ (Higher is better) 0.78 GFLOPS / $ 0.97 GFLOPS / $ 1.32 GFLOPS / $

The DGX A100 offers far superior node-to-node communication bandwidth when compared with the DGX-1 or the Lambda Hyperplane-8 V100. Despite it being less efficient on a GFLOPS / $ basis, you might see the DGX A100 server provide better cluster scaling performance due to its massive increase in node-to-node communication bandwidth.

A Decline in Efficiency

The A100 GPU is the first recent release from NVIDIA that shows an actual reduction in compute power efficiency (GFLOPS / watt). This is especially surprising considering the improvement in the fabrication process technology. The A100 represents a jump from the TSMC 12nm process node down to the TSMC 7nm process node. Despite the expected efficiency increase, the A100 is 0.93x as efficient as the V100.

V100 A100 Change
Theoretical Peak GFLOPS 15,700 GFLOPS 19,500 GFLOPS 1.24x
TDP (watts) 300 W 400 W 1.33x
GFLOPS / watt 52.3 GFLOPS / watt 48.8 GFLOPS / watt 0.93x

Conclusion

Like everybody else, we're excited about the new architecture, and can't wait to put the new A100 GPUs through their paces. However, what we can note as of now is that: there is a decline in power efficiency, but we expect to see a roughly 40% increase in Deep Learning performance when we get to test the cards ourselves.

Appendix

PyTorch FP32 MatMul Benchmark Script

Below is the script we used to calculate the FP32 MatMul performance of the V100 GPU. We'll run the same script on the first A100 GPU that we get access to.

import torch
import time

GPU = True
INFINITY = False

# preload the matrices
print("Preloading matrices...", flush=True)
mats = {}
MATSZ = [100, 500, 1000, 2000, 4000, 8000, 16000, 32000]
for msz in MATSZ:
    mat = torch.rand(msz, msz)
    if GPU:
        mats[msz] = mat.cuda()
    else:
        mats[msz] = mat
print("Done preloading.", flush=True)

def run_benchmark(msz):
    ops = 2*msz**3
    start = time.time()
    mat = mats[msz]
    r = torch.mm(mat, mat).cuda()
    if GPU:
        torch.cuda.synchronize()
    end = time.time()
    duration = end - start
    tflops = ops / duration / 10**12
    print("{}x{} MM {} ops in {} sec = TFLOPS {}"
          .format(msz, msz, ops, duration, tflops),
          flush=True)

while True:
    print("======== Results ======== ", flush=True)
    for msz in MATSZ:
        run_benchmark(msz)
    if not INFINITY:
        break

MatMul (not quite an SGEMM but close enough)  benchmark results from a V100 GPU:

$ python3 matmul.py
100x100 MM 2000000 ops in 0.36130833625793457 sec = TFLOPS 5.535438292716886e-06
500x500 MM 250000000 ops in 0.00015306472778320312 sec = TFLOPS 1.6332959501557631
1000x1000 MM 2000000000 ops in 0.0006077289581298828 sec = TFLOPS 3.290940761082777
2000x2000 MM 16000000000 ops in 0.0011909008026123047 sec = TFLOPS 13.435208008008008
4000x4000 MM 128000000000 ops in 0.009345531463623047 sec = TFLOPS 13.696385325781927
8000x8000 MM 1024000000000 ops in 0.06832146644592285 sec = TFLOPS 14.987968690784859
16000x16000 MM 8192000000000 ops in 0.5691955089569092 sec = TFLOPS 14.39224286047586
32000x32000 MM 65536000000000 ops in 4.465559244155884 sec = TFLOPS 14.675877402313615

Die Size Analysis & Silicon Economics

The A100 SXM4 module graphic from NVIDIA lets us calculate the die size (purple square above) and do some basic silicon economics.

Using public images and specifications from NVIDIA's A100 GPU announcement and a knowledge of optimal silicon die layout, we were able to calculate the approximate die dimensions of the new A100 chip:

Known Die Area: 826 mm²
Die Size in Pixels: 354 px * 446 px
Die Aspect Ratio: ~0.793721973
dar = a / b
a * b = 826
a = 826 / b
dar = 826 / b^2
b^2 = 826 / dar
b = sqrt(826 / dar)
a = 826 / sqrt(826 / dar)

a = 826/sqrt(826/0.792792793) = 25.5899755
b = sqrt(826/0.792792793) = 32.2593656

We then lay a prototype die of (25.589 mm x 32.259 mm) across the known usable area of a 300mm silicon wafer. It's clear that there are four dies, (two on each side) that won't fit unless we shrink the width down to 25.5 mm. This then gives us a rough estimate of a potential die size of 25.5 mm x 32.4 mm. This gives a total area of 826.2 mm² and 64 dies per wafer.

Image created using CALY Technologies' die yield calculator.

We conclude that the size of the A100 is approximately 25.5 mm x 32.4 mm and that they can fit 64 dies on a single wafer.

FP32: # Images Processed Per Sec During TensorFlow Training (1 GPU)

Model / GPU RTX 2080 Ti RTX 2080 Titan RTX Titan V V100 Titan Xp 1080 Ti
ResNet-50 294 213 330 300 405 236 209
ResNet-152 110 83 129 107 155 90 81
Inception v3 194 142 221 208 259 151 136
Inception v4 79 56 96 77 112 63 58
VGG16 170 122 195 195 240 154 134
AlexNet 3627 2650 4046 3796 4782 3004 2762
SSD300 149 111 169 156 200 123 108

Lambda Hyperplane A100 Server

Lambda Hyperplane A100 GPU Server

We're now accepting pre-orders for our Hyperplane A100 GPU server. Available in both 8x and 4x GPU configurations.

!-- Intercom -->