Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere

December 12, 2024 • 20 min read

Today, we’re excited to announce the GA release of the Lambda Inference API, the lowest-cost inference anywhere. For just a fraction of a cent, you can access the latest LLMs through a serverless API.

Generate your own API key and see it for yourself:

curl https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer $LAMBDA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "$(
    curl -sL https://tinyurl.com/mvs27aet | \
    jq -Rs --arg prompt 'Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:' \
      '{
        model: "llama3.3-70b-instruct-fp8",
        prompt: ($prompt + "\n\n" + (. | @json)),
        temperature: 0
      }'
  )" | jq .

Our new Lambda Inference API offers low-cost, scalable AI inference with some of the latest models, such as the recently released Llama 3.3 70B Instruct (FP8), at just $0.20 per million input and output tokens. That’s the lowest-priced serverless AI inference available anywhere at less than half the cost of most competitors.

Choose from “Core” models, which are selected for stability and long-term support, or “Sandbox” models provide access to the latest innovations with more frequent updates. The API scales effortlessly to handle workloads of any size and integrates seamlessly with OpenAI-style endpoints, making implementation quick and easy.

Lambda Inference API Pricing

Model	Context	Price per 1M input/output tokens
Core
Llama-3.1-8B-Instruct (BF16)	131K	$0.03
Llama-3.1-70B-Instruct (FP8)	131K	$0.20
Llama-3.1-405B-Instruct (FP8)	131K	$0.90
Sandbox
Llama-3.3-70B-Instruct (FP8)	131k	$0.20
Llama-3.2-3B-Instruct (BF16)	131K	$0.02
Hermes-3-Llama-3.1-8B (BF16)	131K	$0.03
Hermes-3-Llama-3.1-70B (FP8)	131K	$0.20
Hermes-3-Llama-3.1-405B (FP8)	131K	$0.90
LFM-40B (BF16)	66K	$0.15
Llama3.1-nemotron-70B-instruct (FP8)	131K	$0.20
Qwen2.5-Coder-32B (BF16)	33K	$0.09

* plus applicable sales tax

AI without the complexity

Inference is where trained models prove their worth. It’s where the AI model takes in new data (aka prompts)—text, images, and embeddings—and generates actionable predictions, insights, or even videos of fire-fighting kittens in near real-time.

From powering conversational agents to generating images, inference is at the heart of every AI-driven application.

But let’s face it: deploying AI at scale is no easy feat. It requires massive amounts of compute, significant expertise in MLOps to set everything up and performance tune it, as well as a hefty budget to keep it all running smoothly. If you’ve ever tried deploying an AI model before, you know.

That’s why we built the Lambda Inference API, to make it simple, scalable, and accessible. For over a decade, Lambda has been engineering every layer of our stack– hardware, networking, and software, for AI performance and efficiency.

We’ve taken everything we’ve learned since then and built an Inference API, underpinned by an industry-leading inference stack, that’s purpose-built for AI.

Cut costs. Not corners.
- Lambda Inference API provides Meta’s Llama 3.1 405B at just $0.90 per million tokens.
Pay-per-token
- You’re only charged for the tokens you use, ensuring zero waste and complete transparency. No hidden fees or long-term commitments.
Scalability? Handled.
- Designed to dynamically meet the demands of workloads of any size, so you can scale without worrying about infrastructure bottlenecks.
No rate limits
- Run inference unconstrained by any limits on your API calls.

Whether you're supporting a handful of users or millions, our API dynamically scales to meet demand.

Getting Started with Lambda Inference API

It’s easy to get started with the Lambda Inference API, and for those VScode lovers out there, we have a quickstart guide on integrating the Lambda Inference API into VS Code. Here’s how:

Generate an API key
Choose your model
Pick your endpoint:
1. /completions - single text string (a prompt) as input, then outputs a response
2. /chat/completions - takes a list of messages that make up a conversation

Then, using your language of choice start leveraging the latest and greatest models.

# Lambda_Infernce_API_test.py
from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
   api_key=openai_api_key,
   base_url=openai_api_base,
)

model = "<MODEL>"

response = client.completions.create(
 prompt="Computers are",
 temperature=0,
 model=model,
)

print(response)

Check out our Documentation for more information on how integrate the API and our /completions and /chat/completions endpoints in your application.

It’s a great time to be a builder

With so many new models coming, it can be tough to keep up and find out which ones are worth integrating, especially as their VRAM requirements continue to grow. Until now, developers had to worry about managing the infrastructure to run these models. But with the Lambda Inference API, you can simply make an API call to the latest models for fractions of a cent.

Conversational Agents:

Power chatbots and virtual assistants for a fraction of the cost of anywhere else.

curl https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer "$LAMBDA_API_KEY"" \
  -H "Content-Type: application/json" \
  -d '{
     "model": "hermes3-70b",
     "prompt": "Create a concise relevant reply to the following message from a customer: I cant log into the app, Im getting a 500 error",
     "temperature": 0
     }' | jq .

Output:

Sorry to hear that you're experiencing issues logging in to the app. A 500 error usually indicates 
a server-side issue on our end. Can you please try closing and reopening the app, 
or clearing your cache and trying again? If the issue persists, our support team will be happy to 
assist you further.

Total Tokens: 132

Total Cost: $0.00003 (rounded)

Content Generation and Summarization:

Use the Lambda Inference API to summarize large amounts of text, with most models supporting context windows up to 131k. For example, this command summarizes 80% of The Odyssey by Homer for just a few cents.

curl https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer "$LAMBDA_API_KEY"" \
  -H "Content-Type: application/json" \
  -d "$(curl -sL https://tinyurl.com/hthdw2nn | jq -Rs --arg prompt "Summarize this text:" \
     '{
      model: "llama3.3-70b-instruct-fp8",
      prompt: ($prompt + "\n\n" + .),
      temperature: 0
     }')" | jq .

Output

Penelope's prayer to Diana shows her desperation and longing for her husband, as well as her 
cleverness in drawing parallels to mythology. The story of the daughters of Pandareus highlights 
her fear of being forced to marry one of the suitors against her will.

The fact that Penelope is still haunted by her misery even in her dreams emphasizes the depth 
of her love for Ulysses and the toll his long absence has taken on her. Her dream of Ulysses by 
her side also foreshadows his imminent return, building anticipation for the reunion to come.

This passage underscores Penelope's fidelity, intelligence, and emotional resilience in the face of 
overwhelming challenges, reinforcing her status as one of the most significant and admirable 
characters in the epic.

Total tokens: 120724

Cost: $ 0.02 (rounded)

A Glimpse Into the Future

We’re just getting started. Here’s what’s next for the Lambda Inference API:

More models:
- Expect support for additional state-of-the-art models, enabling even more use cases across industries.
More formats:
- Support for multimodal models, reasoning models, image generation, video generation, and more
Batch inference:
- Lower-cost, batched inference for non-realtime, offline, or overnight processing tasks.

Start Building Today

With the Lambda Inference API, you can leverage cutting-edge AI models without the high costs, infrastructure headaches, or operational complexity.