# Tesla A100 Server Total Cost of Ownership Analysis

This post compares the Total Cost of Ownership (TCO) for Lambda servers and clusters VS cloud instances with NVIDIA A100 GPUs. We first calculate the TCO for individual Lambda Hyperplane-A100 and Scalar servers and then compare the cost of renting a similarly equipped AWS EC2 p4d.24xlarge instance. We then walk through the cost of building and operating server clusters using NVIDIA A100 GPUs.

## Lambda A100 servers v.s. AWS p4d.24xlarge

Two models of Lambda servers are available with NVIDIA Tesla A100 GPUs: Hyperplane-A100 and Scalar-A100. The main difference is how A100 GPUs are interconnected in each server. The A100 GPUs in the Hyperplane-A100 are connected via NVLink + NVSwitch. In the Scalar server, the GPUs are connected with PCIe. For this reason, the Hyperplane-A100 server scales better when multi-GPU training becomes communication-bottlenecked. More importantly, Hyperplane-A100 servers support InfiniBand (8x IB ports per server) and GPUDirect Remote Direct Memory Access (RDMA), enabling them to scale efficiently for multi-node distributed training. In addition, the A100-PCIe GPU model used in the Scalar server consumes less power than the A100-SXM4 GPU used in the Hyperplane servers (250W v.s. 400W), so they may experience a 10% throughput drop while performing machine learning tasks.

On the cloud side, the AWS EC2 p4d.24xlarge instance is similar to the eight GPU Lambda Hyperplane 8-A100. The main difference is the choice of CPU – the Hyperplane-A100 uses the latest AMD EPYC CPUs over the Intel Xeon CPUs due to their higher number of cores and PCIe 4.0 support.

#### Test Configurations

Hyperplane-A100 Scalar-A100 p4d.24xlarge
GPU 8x A100-SXM4-40GB 8x A100-PCIe-40GB 8x A100-SXM4-40GB
CPU 2x AMD EPYC 7763 (64-Core) 2x AMD EPYC 7763 (64-Core) Intel(R) Xeon(R) Platinum 8275CL (24-Core)
System Memory 1.0 TB ECC 1.0 TB ECC 1.1 TB ECC
Storage 15.36 TB NVMe 15.36 TB NVMe 8 TB NVMe

## Takeaway

We will give a walkthrough of the performance and cost analysis in later sections, but first, we need to discuss the key metric for measuring cost-effectiveness: flops/$ or flops per dollar. Flops/$ is the amount of computation you can buy with one dollar. It is calculated by dividing the total petaflops generated over a time period by the total cost of purchasing and operating the system over that same time period.

Higher flops/$ is more cost-effective. As the following two tables show, on-prem servers give significantly better flops/$ compared to cloud instances, as long as one does not scrap the value of the servers to zero after a year (100% annual depreciation).

Note that these two tables assume 50% occupancy for the on-prem and reserved cloud instances. We will have a break-even occupancy analysis later in this blog.

## Break-even Occupancy Analysis

#### Hyperplane-A100 Cluster with InfiniBand Networking

1x Cluster
8x Cluster
16x Cluster
Upfront $233,422$1,610,948 $3,137,649 Annual Total Operating Costs$23,894 $111,804$209,723
Annual System Administration Cost $10,000$20,000 $40,000 Annual Co-location Cost$13,893 $101,804$199,722
Total Cost at Year 1 $257,316$1,722,752 $3,347,372 Total Cost at Year 3$305,104 $1,946,360$3,766,818
Total Cost at Year 5 $352,892$2,169,968 $4,186,264 Number of nodes 1 8 16 Amortized Cost / Year / Node (5 Years of Use)$70,578 $54,249$52,328

Similarly, the following table shows the TCO of a Scalar-A100 clusters (without InfiniBand networking and assuming distributed training is not the use case)

#### Scalar-A100 Cluster

1x Cluster
8x Cluster
16x Cluster
Upfront $161,757$1,028,964 $2,025,382 Annual Total Operating Costs$16,779 $59,084$107,432
Annual System Administration Cost $10,000$20,000 $40,000 Annual Co-location Cost$6,779 $49,084$97,432
Total Cost at Year 1 $178,536$1,088,048 $2,132,814 Total Cost at Year 3$212,094 $1,206,216$2,347,678
Total Cost at Year 5 $245,652$1,324,384 $2,562,542 Number of nodes 1 8 16 Amortized Cost / Year / Node (5 Years of Use)$49,130 $33,109$32,031

The above tables compare the Hyperplane-A100 TCO and the Scalar-A100 TCO. Although Scalar-A100 clusters come at a lower upfront and operation cost, which type of A100 server should be used depends on the use cases. We recommend Hyperplane-A100 for clusters that run distributed training across multiple nodes. This is due to the superior inter-node communication of the 8 InfiniBand ports in the Hyperplane server. On the other hand, if the main use case of the cluster is inference or single-node training, the Scalar-A100 server can be a more cost-effective choice.

## Performance

Last but not least, we benchmark these servers using various deep learning models and compare their training throughput in the figures below. All benchmarks used the NGC PyTorch container and can be reproduced using this repo.

As expected, the Lambda Hyperplane-A100 server and AWS p4d.24xlarge instance deliver similar performance (the averaged difference is less than 1%) since they use similar hardware configurations. Scalar-A100 is about 7.2% and 8.2% slower in TF32 and AMP (Automatic Mixed Precision), respectively, due to its slower GPU interconnection and less power consumption.

## Footnotes

1. We think 50% for Power Utilization Ratio is a fair estimation considering most of the time the system didn't draw full power. This is due to factors such as job scheduling, I/O or device-to-device communication bottlenecks, and sub-optimized code.
2. $10,000 per server is generious considering the average salary for a full time data center system administrator is $64,892.
3. We assume a simple linear model and depreciate the value of the server by 25% of its purchase price per year.
4. The cheapest 1-yr reserved plan (all upfront) on AWS is more expensive than the TCO of a Lambda Hyperplane-A100 with 100% annual depreciation rate. Which means you can use less money to buy and operate a Lambda Hyperplane-A100 server for a year and keep the server for free afterwards.
5. Increasing occupancy rate from 50% to 100% will double the total petaflops over a year, but only increase the 1-yr TCO by $10,534 (doubling the electric bill). In consequence the petaflops/$ will increase to 74.5