Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere
Today, we’re excited to announce the GA release of the Lambda Inference API, the lowest-cost inference anywhere. For just a fraction of a cent, you can access the latest LLMs through a serverless API.
Generate your own API key and see it for yourself:
curl https://api.lambdalabs.com/v1/completions \
-H "Authorization: Bearer $LAMBDA_API_KEY" \
-H "Content-Type: application/json" \
-d "$(
curl -sL https://tinyurl.com/mvs27aet | \
jq -Rs --arg prompt 'Summarize this Inference API blog, emphasizing new features and benefits in three concise bullet points. Here is the post:' \
'{
model: "llama3.3-70b-instruct-fp8",
prompt: ($prompt + "\n\n" + (. | @json)),
temperature: 0
}'
)" | jq .
Our new Lambda Inference API offers low-cost, scalable AI inference with some of the latest models, such as the recently released Llama 3.3 70B Instruct (FP8), at just $0.20 per million input and output tokens. That’s the lowest-priced serverless AI inference available anywhere at less than half the cost of most competitors.
Choose from “Core” models, which are selected for stability and long-term support, or “Sandbox” models provide access to the latest innovations with more frequent updates. The API scales effortlessly to handle workloads of any size and integrates seamlessly with OpenAI-style endpoints, making implementation quick and easy.
Lambda Inference API Pricing
Model | Context | Price per 1M input/output tokens |
Core | ||
Llama-3.1-8B-Instruct (BF16) |
131K | $0.03 |
Llama-3.1-70B-Instruct (FP8) |
131K | $0.20 |
Llama-3.1-405B-Instruct (FP8) |
131K | $0.90 |
Sandbox | ||
Llama-3.3-70B-Instruct (FP8) | 131k | $0.20 |
Llama-3.2-3B-Instruct (BF16) |
131K | $0.02 |
Hermes-3-Llama-3.1-8B (BF16) |
131K | $0.03 |
Hermes-3-Llama-3.1-70B (FP8) |
131K | $0.20 |
Hermes-3-Llama-3.1-405B (FP8) |
131K | $0.90 |
LFM-40B (BF16) |
66K | $0.15 |
Llama3.1-nemotron-70B-instruct (FP8) | 131K | $0.20 |
Qwen2.5-Coder-32B (BF16) |
33K | $0.09 |
* plus applicable sales tax
AI without the complexity
Inference is where trained models prove their worth. It’s where the AI model takes in new data (aka prompts)—text, images, and embeddings—and generates actionable predictions, insights, or even videos of fire-fighting kittens in near real-time.
From powering conversational agents to generating images, inference is at the heart of every AI-driven application.
But let’s face it: deploying AI at scale is no easy feat. It requires massive amounts of compute, significant expertise in MLOps to set everything up and performance tune it, as well as a hefty budget to keep it all running smoothly. If you’ve ever tried deploying an AI model before, you know.
That’s why we built the Lambda Inference API, to make it simple, scalable, and accessible. For over a decade, Lambda has been engineering every layer of our stack– hardware, networking, and software, for AI performance and efficiency.
We’ve taken everything we’ve learned since then and built an Inference API, underpinned by an industry-leading inference stack, that’s purpose-built for AI.
- Cut costs. Not corners.
- Lambda Inference API provides Meta’s Llama 3.1 405B at just $0.90 per million tokens.
- Pay-per-token
- You’re only charged for the tokens you use, ensuring zero waste and complete transparency. No hidden fees or long-term commitments.
- Scalability? Handled.
- Designed to dynamically meet the demands of workloads of any size, so you can scale without worrying about infrastructure bottlenecks.
- No rate limits
- Run inference unconstrained by any limits on your API calls.
Whether you're supporting a handful of users or millions, our API dynamically scales to meet demand.
Getting Started with Lambda Inference API
It’s easy to get started with the Lambda Inference API, and for those VScode lovers out there, we have a quickstart guide on integrating the Lambda Inference API into VS Code. Here’s how:
- Generate an API key
- Choose your model
- Pick your endpoint:
- /completions - single text string (a prompt) as input, then outputs a response
- /chat/completions - takes a list of messages that make up a conversation
Then, using your language of choice start leveraging the latest and greatest models.
# Lambda_Infernce_API_test.py
from openai import OpenAI
openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "<MODEL>"
response = client.completions.create(
prompt="Computers are",
temperature=0,
model=model,
)
print(response)
Check out our Documentation for more information on how integrate the API and our /
completions
and /chat/completions
endpoints in your application.
It’s a great time to be a builder
With so many new models coming, it can be tough to keep up and find out which ones are worth integrating, especially as their VRAM requirements continue to grow. Until now, developers had to worry about managing the infrastructure to run these models. But with the Lambda Inference API, you can simply make an API call to the latest models for fractions of a cent.
Conversational Agents:
Power chatbots and virtual assistants for a fraction of the cost of anywhere else.
curl https://api.lambdalabs.com/v1/completions \
-H "Authorization: Bearer "$LAMBDA_API_KEY"" \
-H "Content-Type: application/json" \
-d '{
"model": "hermes3-70b",
"prompt": "Create a concise relevant reply to the following message from a customer: I cant log into the app, Im getting a 500 error",
"temperature": 0
}' | jq .
Output:
Sorry to hear that you're experiencing issues logging in to the app.
A 500 error usually indicates
a server-side issue on our end. Can you please try closing and reopening the app,
or clearing your cache and trying again? If the issue persists, our support team will be happy to
assist you further.
Total Tokens: 132
Total Cost: $0.00003 (rounded)
Content Generation and Summarization:
Use the Lambda Inference API to summarize large amounts of text, with most models supporting context windows up to 131k. For example, this command summarizes 80% of The Odyssey by Homer for just a few cents.
curl https://api.lambdalabs.com/v1/completions \
-H "Authorization: Bearer "$LAMBDA_API_KEY"" \
-H "Content-Type: application/json" \
-d "$(curl -sL https://tinyurl.com/hthdw2nn | jq -Rs --arg prompt "Summarize this text:" \
'{
model: "llama3.3-70b-instruct-fp8",
prompt: ($prompt + "\n\n" + .),
temperature: 0
}')" | jq .
Output
Penelope's prayer to Diana shows her desperation and longing for her husband, as well as her
cleverness in drawing parallels to mythology. The story of the daughters of Pandareus highlights
her fear of being forced to marry one of the suitors against her will.
The fact that Penelope is still haunted by her misery even in her dreams emphasizes the depth
of her love for Ulysses and the toll his long absence has taken on her. Her dream of Ulysses by
her side also foreshadows his imminent return, building anticipation for the reunion to come.
This passage underscores Penelope's fidelity, intelligence, and emotional resilience in the face of
overwhelming challenges, reinforcing her status as one of the most significant and admirable
characters in the epic.
Total tokens: 120724
Cost: $ 0.02 (rounded)
A Glimpse Into the Future
We’re just getting started. Here’s what’s next for the Lambda Inference API:
- More models:
- Expect support for additional state-of-the-art models, enabling even more use cases across industries.
- More formats:
- Support for multimodal models, reasoning models, image generation, video generation, and more
- Batch inference:
- Lower-cost, batched inference for non-realtime, offline, or overnight processing tasks.
Start Building Today
With the Lambda Inference API, you can leverage cutting-edge AI models without the high costs, infrastructure headaches, or operational complexity.
Sign up today and see how the Lambda Inference API can transform your next project, one API call at a time.