Deploying LLMs with Vertex AI

Welcome to an exciting journey of deploying Large Language Models with DeepSpeed on Vertex AI!

Taha Binhuraib
6 min readApr 23, 2023
Image created using SDXL

Deploying large language models to production can be a challenging task for machine learning engineers. However, in this blog post, I will guide you through the process of deploying almost any language model to Vertex AI using DeepSpeed MII. With this solution, you can effectively manage the complexities of deploying large language models, ensuring a seamless transition from development to production.

DeepSpeed-MII is an open-source Python library from DeepSpeed, aimed towards making low-latency, low-cost inference of powerful models easily accessible. With DeepSpeed MII, the tokens generated per second per GPU demonstrate a significant increase in throughput. In fact, MII-Public offers over a 10-fold improvement in throughput compared to other solutions, indicating a substantial increase in the number of tokens generated per second per GPU. MII accelerates over 30,000 models, supporting a variety of tasks like text-generation, question-answering, and text-classification. These models are sourced from popular model repositories such as Hugging Face, FairSeq, and EluetherAI, and cover dense models based on BERT, Roberta or GPT architectures, ranging from a few hundred million parameters to tens of billions. MII is constantly expanding its model support and is working towards adding support for massive hundred billion plus parameter dense and sparse models. MII currently supports the Hugging Face Transformers model families.

One of the key advantages of using DeepSpeed MII is its ease of use. Getting started with this powerful library requires only a few lines of code, making the deployment process remarkably straightforward.


This code demonstrates how to use the mii Python package to deploy a text generation task. It starts by importing the mii package and defining configuration settings for the deployment. Then, the mii.deploy() method is used to deploy the pre-trained bigscience/bloom-560m model for the text generation task.

After deployment, the mii.mii_query_handle() method is used to create a query handler for the deployed model, and the generator.query() method is called to generate text based on the given queries.

import mii
mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
generator = mii.mii_query_handle("bloom560m_deployment")
result = generator.query({"query": ["The theory of relativity is"]}, do_sample=True, max_new_tokens=30)

Now that’s what I call ease of use!

Let’s explore how to integrate DeepSpeed MII with a Flask application. By leveraging the simplicity of Flask, we can easily integrate our language model with a web service, allowing for quick and efficient inference.

As per the custom container requirements outlined by the custom container requirements for prediction using Vertex AI, it is necessary to define two routes for integrating a Flask application with the system. The first route, called /isalive, is used to check the health of the Flask app and return a response with a status code of 200 if the app is ready for inference. The second route, /predict, is used for making predictions. By defining these two routes, we can ensure that our Flask application is fully compatible with Vertex AI.

def isalive():
print("/isalive request")
status_code = Response(status=200)
return status_code

Our predict route will be defined as the following

@app.route('/predict', methods = ['POST'])
def predict():
form = request.get_json()
return writer.generator(form['instances'])

According to the requirements, the “/predict” route must handle POST requests with a JSON payload in the following format: { “instances”: [dict, …]}. When working with large language models, specific parameters such as prompt, top_p, top_k, max_tokens, and others may be required. However, regardless of the specific parameters used, it is essential to note that the data will be present under the “instances” field as a list, which contains a collection of dictionaries. By adhering to this standard format, we can ensure that our Flask application can effectively handle requests for making predictions with large language models in production environments.

To launch our Flask application, we will configure it to run on port 8080, as this is a requirement for deploying to Vertex AI.

if __name__ == "__main__":
port = int(os.environ.get('PORT', 8080)), host='', port=port, use_reloader=False)

Our model class is a simple class that enables seamless loading and querying of the language model.

class LLMBaseModel:
"""This module will handle generating text"""

def __init__(
num_tokens: int = 200,
top_p: float = 0.92,
top_k: int = 50,
model_name: str = "bigscience/bloom",
self.model_name = model_name
self.top_p = top_p
self.top_k = top_k
self.world_size = torch.cuda.device_count()
self.num_tokens = num_tokens"Using {self.world_size} gpus")

try:"Loading model: {self.model_name}")
self.model = DS()"model_loaded!")

except Exception:
raise ValueError(f"model is deprecated: {self.model_name}")

def generator(self, form: Mapping[str, Any]) -> str:'this is the form: {form}')
prompt = form[0]["prompt"]
top_p = form[0]["top_p"]
top_k = form[0]["top_k"]
max_new_tokens = form[0]["max_new_tokens"]
response = self.model.generate(prompt, top_p, top_k, max_new_tokens)
response = response["generation"][0]
return {"response": str(response)}

It’s possible to enhance the functionality of our model class by implementing a helper method for reading the data, for the purposes of this demonstration, I have opted to keep it as straightforward as possible.

Containerize the Flask App

This process should be relatively straightforward. The only point to consider is that the parent image must contain a CUDA runtime and CUDA development tools. To ensure that drivers such as nvcc are available, which is used by deepspeed, make sure to use an image that includes a dlevel tag.

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y nginx python3 python3-dev python3-pip && \
rm -rf /var/lib/apt/lists/*
COPY . .
RUN pip3 install -r requirements.txt

ENTRYPOINT [ "python3" ]
CMD ["" ]

Deploy to Vertex AI

Let’s get started! Our first step is to build and push the containerized Flask application. We can achieve this with the following two commands.

docker build -t<PROJECT-ID>/<IMAGE-TAG> .
docker push<PROJECT-ID>/<IMAGE-TAG>

If you navigate to the Container Registry page in your Google Cloud Console, you should see a directory containing your Docker image.

Uploading the model using the gcloud CLI is as simple as:

gcloud ai models upload \
--container-ports=80 \
--container-predict-route="/predict" \
--container-health-route="/health" \
--region=us-central1 \
--display-name=test_llms \

Create the endpoint

gcloud ai endpoints create \
--project=tahatest-playground \
--region=us-central1 \

The previous command may take approximately 15 minutes to complete. Once it’s done, you can use the model and endpoint IDs to deploy your model to an endpoint.

!gcloud ai endpoints deploy-model <ENDPOINT-ID>\
--project=tahatest-playground \
--region=us-central1 \
--model=<MODEL-ID> \
--traffic-split=0=100 \
--machine-type="a2-highgpu-1g" \

When deploying your model on Vertex AI, it is important to choose a machine-type that supports GPUs. You can find a list of available configurations for the machines that support GPUs on this page: Configure compute. Be sure to choose the appropriate machine-type based on your requirements and budget.


Now, for the moment of truth, let’s begin with inference! There are several ways to send requests to your Vertex AI endpoint, but here’s an easy one. First, ensure that is installed. Then, you can use the following code snippet to send requests to your endpoint.

!pip install
from import aiplatform

project = 'PROJECT-ID'
location = 'us-central1'

endpoint = aiplatform.Endpoint("projects/<YOUR-PROJECT-ID>/locations/us-central1/endpoints/<ENDPOINT-ID>")
instances = [
"prompt":["hello"], "top_k": 50, "top_p": 0.92, "max_new_tokens": 100

prediction = endpoint.predict(instances=instances)


Congratulations! You just deployed a deep learning model to Google Cloud’s Vertex AI platform using Docker, Flask, and DeepSpeed MII for faster inference. This involved building a Docker image for your Flask application, pushing it to the Google Cloud Container Registry, and deploying it to a Vertex AI endpoint.





Taha Binhuraib

AI and Machine Learning enthusiast. Machine learning engineer