The Lambda Deep Learning Blog

Featured Posts

Recent Posts

RTX A6000 vs RTX 3090 Deep Learning Benchmarks

PyTorch benchmarks of the RTX A6000 and RTX 3090 for convnets and language models - both 32-bit and mix precision performance.

Published 08/09/2021 by Chuan Li

OpenAI's GPT-3 Language Model: A Technical Overview

Chuan Li, PhD reviews GPT-3, the new NLP model from OpenAI. The technical overview covers how GPT-3 was trained, GPT-2 vs. GPT-3, and GPT-3 performance.

Published 06/03/2020 by Chuan Li

Training Neural Networks in Record Time with the Hyperplane-16

Scaling out deep learning infrastructure becomes easier with 16 NVIDIA Tesla V100 GPUs and preinstalled frameworks like TensorFlow, Keras, and PyTorch.

Published 12/19/2019 by Chuan Li

TensorFlow 2.0 Tutorial 01: Basic Image Classification

This tutorial explains the basics of TensorFlow 2.0 with image classification as the example. 1) Data pipeline with dataset API. 2) Train, evaluate, save and restore models with Keras. 3) Multiple-GPU with distributed strategy. 4) Customized training with callbacks.

Published 10/01/2019 by Chuan Li

Setting up Horovod + Keras for Multi-GPU training

This tutorial will walk you through how to setup a working environment for multi-GPU training with Horovod and Keras.

Published 08/28/2019 by Chuan Li

Tracking system resource (GPU, CPU, etc.) utilization during training with the Weights & Biases Dashboard

Resource utilization tracking can help machine learning engineers improve their software pipeline and model performance. This blog discusses how to use Weights & Biases to inspect the efficiency of TensorFlow training jobs.

Published 08/12/2019 by Chuan Li

TensorFlow 2.0 Tutorial 05: Distributed Training across Multiple Nodes

Distributed training allows scaling up deep learning tasks so bigger models can be learned from more extensive data. In this tutorial, we will explain how to do distributed training across multiple nodes.

Published 06/07/2019 by Chuan Li

TensorFlow 2.0 Tutorial 04: Early Stopping

This tutorial explains how early stopping is implemented in TensorFlow. The key lesson is to use tf.keras.EarlyStopping callback. Early stopping is triggered by monitoring if a certain quantity has improved over the latest period of time.

Published 06/06/2019 by Chuan Li

TensorFlow 2.0 Tutorial 03: Saving Checkpoints

This tutorial explains how to use checkpoints to save and restore TensorFlow models during the training. The key is to use tf.kears.ModelCheckpoint callbacks to save the model. Set initial_epoch in the model.fit call to restore the model from a pre-saved checkpoint.

Published 06/06/2019 by Chuan Li

TensorFlow 2.0 Tutorial 02: Transfer Learning

This tutorial explains how to do transfer learning with TensorFlow 2. We cover handling customized datasets, restoring backbone with Keras's application API, and restoring backbone from the disk.

Published 06/05/2019 by Chuan Li

V100 server on-prem vs AWS p3 instance cost comparison

A cost and speed comparison between the Lambda Hyperplane 8 V100 GPU Server and AWS p3 GPU instances. A very similar comparison to the DGX-1.

Published 02/11/2019 by Chuan Li

Text Generation: Char-RNN Data preparation and TensorFlow implementation

This tutorial is about making a character-based text generator using a simple two-layer LSTM. It will walk you through the data preparation and the network architecture. TensorFlow implementation is available at the end of the tutorial.

Published 02/08/2019 by Chuan Li

Multi-GPU enabled BERT using Horovod

BERT is Google's SOTA pre-training language representations. This blog is about running BERT with multiple GPUs. Specifically, we will use the Horovod framework to parrallelize the tasks. We ill list all the changes to the original BERT implementation and highlight a few places that will make or break the performance.

Published 02/06/2019 by Chuan Li

...

Next page