Setting up Horovod + Keras for Multi-GPU training

This blog will walk you through the steps of setting up a Horovod + Keras environment for multi-GPU training.


  • Hardware: A machine with at least two GPUs
  • Basic Software: Ubuntu (18.04 or 16.04), Nvidia Driver (418.43), CUDA (10.0) and CUDNN (7.5.0). All of these can be easily installed using Lambda Stack for free.


There are several of things to be installed on top of the necessary software stack: NCCL2, Open MPI (optional), and Horovod. Getting the installation right can be a bit tedious for first time users, which motivated us to write this step-by-step guide to help people get it right.

You can find the one-stop installation script for all the steps after the download of the NCCL2 library.


The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are optimized for NVIDIA GPUs. Think it as a library used by Horovod to improve the performance of all-gather, reduce, broadcast, etc. across GPUs devices.

You need an NVIDIA Developer Program account to download the library. The registeration is free at this link. Once login, download the library with the version that suits your CUDA environment. In our case, it is the NCCL v2.4.8, for CUDA 10.0, July 31, 2019, O/S agnostic local installer.

To install the library, one needs to copy the files to specific locations and add their path to LD_LIBRARY_PATH:

tar -vxf ~/Downloads/nccl_2.4.8-1+cuda10.0_x86_64.txz -C ~/Downloads/
sudo cp ~/Downloads/nccl_2.4.8-1+cuda10.0_x86_64/lib/libnccl* /usr/lib/x86_64-linux-gnu/
sudo cp ~/Downloads/nccl_2.4.8-1+cuda10.0_x86_64/include/nccl.h  /usr/include/
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu' >> ~/.bashrc
source ~/.bashrc

Open MPI (Optional)

This is not a must-do. However, Horovod has a "Open MPI-based wrapper", and it is very convenient to use them together. It is a bit like the relationship between Keras and TensorFlow – you don't have to use TensorFlow, because Keras also supports Theano and CNTK. Nonetheless, in practice most Keras users do install TensorFlow as the backend.

In the meantime, Open MPI can be used in conjunction with NCCL: It is easy to use MPI for CPU-to-CPU communication and NCCL for GPU-to-GPU communication. We won't go into too much details here. A more in-depth discussion about the relationship between NCCL and MPI can be found here.

Before we install Open MPI, there is a catch: By default, Ubuntu comes with an outdated version mpirun and no mpicxx (the C++ compiler for Open MPI). This default setup is insufficient to work with Horovod, and can potentially cause a problem for your later installation. Hence the first thing you should do is removing these pre-installed files:

sudo mv /usr/bin/mpirun /usr/bin/bk_mpirun
sudo mv /usr/bin/mpirun.openmpi /usr/bin/bk_mpirun.openmpi

Then we can install Open MPI with following steps:

wget -P ~/Downloads
tar -xvf ~/Downloads/openmpi-4.0.1.tar.gz -C ~/Downloads
cd ~/Downloads/openmpi-4.0.1
./configure --prefix=$HOME/openmpi
make -j 8 all
make install

echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/openmpi/lib' >> ~/.bashrc
echo 'export PATH=$PATH:~/openmpi/bin' >> ~/.bashrc
source ~/.bashrc


Finally, we install Horovod, Keras, and TensorFlow-GPU in a Python3 virtual environment. g++-4.8 is also needed for Horovod to work with the pip installed TensorFlow.


# Install g++-4.8 (for running horovod with TensorFlow)
sudo apt install g++-4.8

# Create a Python3.6 virtual environment
sudo apt-get install python3-pip
sudo pip3 install virtualenv 
virtualenv -p /usr/bin/python3.6 venv-horovod-keras
. venv-horovod-keras/bin/activate

# Install keras and TensorFlow GPU backend
pip install tensorflow-gpu==1.13.2 keras


Multi-GPU Training

To test the environment, we run the keras MNIST training example from the Horovod official Repo.

git clone
cd horovod/examples
. venv-horovod-keras/bin/activate

There are actually two ways to run a Horovod job: using the horovodrun wrapper, or using the mpirun API. Below are the examples of running a training script with two GPUs on the same machine:

horovodrun -np 2 -H localhost:2 --mpi python

mpirun -np 2 \
    -H localhost:2 \
    -bind-to none -map-by slot \
    -mca pml ob1 -mca btl ^openib \

These two commands do basically the same thing, whereas the mpirun method allows more configurations. For example, the NCCL_DEBUG=INFO option allows the display of NCCL devices information for the job.


This tutorial demonstrated how to setup a working environment for multi-GPU training with Horovod and Keras. You can find the one-stop installation script for all the steps after the download of the NCCL2 library.

!-- Intercom -->