Reproduce imagenet18 with a Titan RTX server

Last, tyear, won the first ImageNet training cost challenge as part of the DAWN benchmark. Their customized ResNet50 takes 3.27 hours to reach 93% Top-5 accuracy with an AWS p3.16xlarge (8 x V100 GPUs). This year, teamed up with DIUx and cut down the training time to 18 minutes with a cluster of sixteen p3.16xlarge machines. This was the quickest solution at the time it was announced (Sep 2018).

In this blog, we reproduce the latest's ImageNet result with a single 8 Turing GPUs (Titan RTX) server. It takes 2.36 hours to achieve 93% Top-5 accuracy.

Here are the details of the statistics:

Epoch Training Time (hour) Top-1 Acc Top-5 Acc
1 0.0539 7.2800 19.1979
2 0.0921 18.3619 39.6699
3 0.1306 26.1700 50.2779
4 0.1691 29.9260 54.5460
5 0.2078 32.3339 58.0260
6 0.2465 27.2560 50.7680
7 0.2852 30.0799 54.3160
8 0.3240 39.1959 65.4260
9 0.3627 42.8860 69.2040
10 0.4014 45.1940 70.9100
11 0.4402 49.3839 74.9639
12 0.4788 54.9459 79.5660
13 0.5174 58.5820 81.8980
14 0.6433 57.3959 81.5960
15 0.7569 53.0480 77.7799
16 0.8703 58.9599 82.7979
17 0.9845 60.4039 83.8259
18 1.0982 62.3779 84.8720
19 1.2124 64.9540 86.6080
20 1.3258 65.9520 87.2919
21 1.4390 68.3700 88.7060
22 1.5529 71.4420 90.4820
23 1.6673 72.0479 90.6679
24 1.7826 72.8679 91.1559
25 1.8957 73.5739 91.4960
26 2.1455 75.8519 92.9879
27 2.3657 75.9179 93.0179

You can jump to the code and the instructions from here.


Jeremy Howard wrote a brilliant blog about the technical details behind their approach. Our takeaways are:

Inference with dynamic-size images: A key idea from the latest entry is to reduce the information loss in the preprocessing stage of the inference. Let's first take a look at a common practice in running inference for image classification: Due to the use of fully connected layers, images are first transformed to a fixed size before processed by the network. This usually involves either perspective distortion (via image resizing) or losing a significant part of the image content (via cropping). In the second case, people often use multiple random crops to increase the accuracy.

It is clear that such preprocessing will have a big impact on the network's performance: the network needs to be robust enough to recognize "distorted" or "cropped" objects, which often requires more epochs in training.

The key observation made by the team is to remove unnecessary preprocessing by getting rid of the fix-size restriction. This is achieved by replacing the fully connected layers by global pooling layers, which have no restriction on the size of the input. Now inference becomes an "easier" task because it deals with input images that have not been severely distorted nor cropped. In consequence, less training epochs are required to reach a certain testing accuracy. According to the authors doing so brought in 23% reduction of the training time for reaching the target accuracy.

Progressive training: Another interesting technique adopted by the team is progressive training with images of multiple resolutions. The training started with a low resolution (128 x 128) for input images and a larger batch size to quickly achieve certain accuracy; it then increased the resolution (first to 244 x 244 and then 288 x 288) for the expensive fine-tuning. This allows overall fewer epochs to achieve the target test accuracy. Notice this is only possible due to the replacement of fully connected layers by the global pooling layers, so the network trained with low-resolution images can work with the higher resolution images without modification. In the meantime, the batch size and learning rate are carefully scheduled for each resolution to get the desired performance as quickly as possible.


In this post, we reproduced the current state of the art ImageNet training performance on a single Turing GPU server. It is exciting to see only 2.4 hours are required to achieve 93% Top-5 accuracy. Next time we will reproduce the training in a multi-node distributed fashion within a local network.


You can reproduce the results with this repo.

First, clone the repo and setup a Python 3 virtual environment:

git clone

cd imagenet18

virtualenv -p python3 env
source env/bin/activate

pip install -r requirements_local.txt

Then download the data to your local machine (be aware that the tar files are about 200 GB in total):



tar -xvf imagenet-data-sorted.tar -C /mnt/data/data

tar -xvf imagenet-sz.tar -C /mnt/data/data

Finally run the following command to reproduce the results on a 8-GPU server. Set the "nproc_per_node" to match the number of GPUs on your machine.

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 \
training/ /mnt/data/data/imagenet \
--fp16 --logdir ./ncluster/runs/lambda-blade --distributed --init-bn0 --no-bn-wd \
--phases "[{'ep': 0, 'sz': 128, 'bs': 512, 'trndir': '-sz/160'}, {'ep': (0, 7), 'lr': (1.0, 2.0)}, {'ep': (7, 13), 'lr': (2.0, 0.25)}, {'ep': 13, 'sz': 224, 'bs': 224, 'trndir': '-sz/320', 'min_scale': 0.087}, {'ep': (13, 22), 'lr': (0.4375, 0.043750000000000004)}, {'ep': (22, 25), 'lr': (0.043750000000000004, 0.004375)}, {'ep': 25, 'sz': 288, 'bs': 128, 'min_scale': 0.5, 'rect_val': True}, {'ep': (25, 28), 'lr': (0.0025, 0.00025)}]"

To print out the statics, locate the events.out file in the "logdir" folder and simply run this command:

python dawn/ \