This blog describes how to set up a RunAI cluster on Lambda Cloud with one or multiple cloud instances.
The Lambda Deep Learning Blog
Featured Posts
Recent Posts
After a period of closed beta, persistent storage for Lambda GPU Cloud is now available for all A6000 and V100 instances in an extended open beta period.
Published 04/19/2022 by Kathy Bui
This curriculum provides an overview of free online resources for learning about deep learning. It includes courses, books, and even important people to follow. If you only want to do one thing, do this: Train an MNIST network with PyTorch. https://github.com/pytorch/examples/tree/master/mnist Introductory CS231n:
Published 11/01/2021 by Stephen Balaban
Full Video TutorialThis tutorial shows you how to install Docker with GPU support on Ubuntu Linux. To get GPU passthrough to work, you'll need docker, nvidia-container-toolkit, Lambda Stack, and a docker image with a GPU accelerated library. 1) Install Lambda Stack LAMBDA_REPO=$(mktemp) && \ wget -O${LAMBDA_REPO} https://lambdalabs.
Published 07/19/2021 by Stephen Balaban
This guide will walk you through how to load data from various sources onto your Lambda Cloud GPU instance. If you're looking for how to get started and SSH into your instance for the first time, check out our Getting Started Guide.
Published 05/03/2020 by Remy Guercio
This guide will walk you through the process of launching a Lambda Cloud GPU instance and using SSH to log in. For this guide we'll assume that you're running either Mac OSX or Linux. If you're a Windows user we recommend using either...
Published 05/03/2020 by Remy Guercio
This tutorial will walk you through the steps required to set up a Mellanox SB7800 36-port switch. The subnet manager discovers and configures the devices running on the InfiniBand fabric. This tutorial will show you how to set it up via the command line or via the web browser.
Published 10/30/2019 by Stephen Balaban
This tutorial explains the basics of TensorFlow 2.0 with image classification as the example. 1) Data pipeline with dataset API. 2) Train, evaluation, save and restore models with Keras. 3) Multiple-GPU with distributed strategy. 4) Customized training with callbacks
Published 10/01/2019 by Chuan Li
This blog will walk you through the steps of setting up a Horovod [https://github.com/horovod/horovod] + Keras [https://keras.io/] environment for multi-GPU training. Prerequisite * Hardware: A machine with at least two GPUs * Basic Software: Ubuntu (18.04 or 16.04), Nvidia Driver (418.43), CUDA (10.0)
Published 08/28/2019 by Chuan Li
One of the most asked questions we get at Lambda Labs is, “how do I track resource utilization for deep learning jobs?” Resource utilization tracking can help machine learning engineers improve both their software pipeline and model performance. I recently came across a tool called "Weights and Biases [https://www.
Published 08/12/2019 by Chuan Li
Distributed training allows scaling up deep learning task so bigger models can be learned from more extensive data. In this tutorial, we will explain how to do distributed training across multiple nodes.
Published 06/07/2019 by Chuan Li
This tutorial explains how early stopping is implemented in TensorFlow. The key lesson is to use tf.keras.EarlyStopping callback. Early stopping is triggered by monitoring if certain quantity has improved over the latest period of time.
Published 06/06/2019 by Chuan Li
This tutorial explained how to use checkpoint to save and restore TensorFlow models during the training. The key is to use tf.kears.ModelCheckpoint callbacks to save the model. Set initial_epoch in the model.fit call to restore the model from a pre-saved checkpoint.
Published 06/06/2019 by Chuan Li