Deploying and managing Llama 4 models involves multiple steps: navigating complex infrastructure setup, managing GPU availability, ensuring scalability, and handling ongoing operational overhead. What if you could address these challenges and focus directly on building your applications? It’s possible with Vertex AI.
We’re thrilled to announce that Llama 4, the latest generation of Meta’s open large language models, is now generally available (GA) as a fully managed API endpoint in Vertex AI! In addition to Llama 4, we’re also announcing the general availability of the Llama 3.3 70B managed API in Vertex AI.
Llama 4 reaches new performance peaks compared to previous Llama models, with multimodal capabilities and a highly efficient Mixture-of-Experts (MoE) architecture. Llama 4 Scout is more powerful than all previous generations of Llama models while also delivering significant efficiency for multimodal tasks and is optimized to run in a single-GPU environment. Llama 4 Maverick is the most intelligent model option Meta provides today, designed for reasoning, complex image understanding, and demanding generative tasks.
With Llama 4 as a fully managed API endpoint, you can now leverage Llama 4’s advanced reasoning, coding, and instruction-following capabilities with the ease, scalability, and reliability of Vertex AI to build more sophisticated and impactful AI-powered applications.
This post will guide you through getting started with Llama 4 as a Model-as-a-Service (MaaS), highlight the key benefits, show you how simple it is to use, and touch upon cost considerations.
Discover Llama 4 MaaS in Vertex AI Model Garden
Vertex AI Model Garden is your central hub for discovering and deploying foundation models on Google Cloud via managed APIs. It offers a curated selection of Google’s own models (like Gemini), open-source models, and third-party models — all accessible through simplified interfaces. The addition of Llama 4 (GA) as a managed service expands this selection, offering you more flexibility.
Accessing Llama 4 as a Model-as-a-Service (MaaS) on Vertex AI has the following advantages:
1: Zero infrastructure management: Google Cloud handles the underlying infrastructure, GPU provisioning, software dependencies, patching, and maintenance. You interact with a simple API endpoint.
2: Guaranteed performance: Processing capacity assigned for these models, ensuring high availability.
3: Enterprise-grade security and compliance: Benefit from Google Cloud’s robust security, data encryption, access controls, and compliance certifications.
Getting started with Llama 4 MaaS
Getting started with Llama 4 MaaS on Vertex AI only requires you to navigate to the Llama 4 model card within the Vertex AI Model Garden and accept the Llama Community License Agreement; you cannot call the API without completing this step.
Once you have accepted the Llama Community License Agreement in the Model Garden, find the specific Llama 4 MaaS model you wish to use within the Vertex AI Model Garden (e.g., “Llama 4 17B Instruct MaaS”). Take note of its unique Model ID (like meta/llama-4-scout-17b-16e-instruct-maas), as you’ll need this ID when calling the API.
Then you can directly call the Llama 4 MaaS endpoint using the ChatCompletion API. There’s no separate “deploy” step required for the MaaS offering – Google Cloud manages the endpoint provisioning. Below is an example of how to use Llama 4 Scout using the ChatCompletion API for Python.
Important: Always consult the specific Llama 4 model card in Vertex AI Model Garden. It contains crucial information about:
The exact input/output schema expected by the model.
Supported parameters (like temperature, top_p, max_tokens) and their valid ranges.
Any specific formatting requirements for prompts or multimodal inputs.
Cost and quota considerations
Using the Llama 4 as Model-as-a-Service on Vertex AI operates on a predictable model combining pay-as-you-go pricing with usage quotas. Understanding both the pricing structure and your service quotas is essential for scaling your application and managing costs effectively when using the Llama 4 MaaS on Vertex AI.
In regards to pricing, you pay only for the prediction requests you make. The underlying infrastructure, scaling, and management costs are incorporated into the API usage price. Refer to the Vertex AI pricing page for details.
To ensure service stability and fair usage, your use of Llama 4 as Model-as-service on Vertex AI is subject to quotas. These are limits on factors such as the number of requests per minute (RPM) your project can make to the specific model endpoint. Refer to our quota documentation for more details.
What’s next
With Llama 4 now generally available as a Model-as-a-Service on Vertex AI, you can leverage one of the most advanced open LLMs without managing required infrastructure.
Explore Llama 4 in Model Garden
Check out the documentation
Review pricing & quotas
We are excited to see what applications you will build with Llama 4 on Vertex AI. Share your feedback and experiences through our Google Cloud community forum.