Tip Sheet #51: Serving an ML model with LitServe


Hi Tip-Sheeters,

One of the key areas where APIs are used in machine learning is for model inference. This is where the model is made available for real-time API calls to receive predictions. This is one of the two primary modes of model inference (along with batch inference).

Tip Sheet #12 listed some of the top Python libraries worth exploring. One of those was LitServe, which is a framework for hosting models with a FastAPI backend. This week, I'll demonstrate this framework using my anchor project as an example.

LitServe: A framework for inference engines

The Top Python Libraries article I referenced in Tip Sheet #12 gave this summary of the value of LitServe:

Built on top of the ubiquitous FastAPI, LitServe transforms the process of serving AI models, adding powerful features like batching, streaming, and GPU autoscaling. With its intuitive design and optimizations, LitServe lets you deploy anything from classic machine learning models to large language models (LLMs) with ease—all while being at least 2x faster than a plain FastAPI setup.

Using FastAPI for serving your inference API is a great place to start. Let's see what additional goodies we get from using LitServe.

Creating a LitServe Project

In Chapter 13 of Hands-on APIs for AI and Data Science, you create an ML model to predict the cost of fantasy football waiver wire picks. The model is created with Scitkit-Learn and stored in the ONNX format. The API is created with FastAPI and uses the ONNX Runtime to speed up inference. We'll use these elements as the base for our project and examine how to migrate them to a LitServe API. All of the completed code is available in the Tip Sheet Github Repo.

Step 1: Setup existing FastAPI API

FastAPI has a new option for building a FastAPI project with uv that I wanted to try out. So I launched a new GitHub Codespace and installed uv with the pip install uv command.

Then I ran the uvx fastapi-new command to create a new FastAPI project named fastapi-model-serving:

This command creates a basic pyproject.toml file. Because my model inference needs some more libraries, I edited this manually to add a few more (in retrospect, I'm not sure if the scikit-learn and skl2onnx are needed in the API or if those were for generating the ONNX model):

Now we'll install the required libraries in our runtime environment using pip install:

The fastapi-new command generates a basic "hello world" API, now you copy over the API code and ML models (three actually). The files are:

acquisition_model_10.onnx - 10th percentile model

acquisition_model_50.onnx - 50th percentile model

acquisition_model_90.onnx - 90th percentile model

main.py - FastAPI program

schemas.py - Pydantic schemas

You can run the FastAPI at this point with uv run fastapi dev main.py to verify the FastAPI setup is working.

(At this point, you've verified the API from the book in this environment, with the new fastapi-new setup command.)

Step 2: Run toy LitServe Example

To get LitServe running in your Codespace, follow the instructions for LitServe quick start. First, we'll add the litserve library to the pyproject.toml dependencies and run pip install again.

Now create the file server.py using the toy example from the litserve documentation:

Run the litserve example with uv run python server.py:

You can test the example by calling it with curl:

(At this point, you've confirmed you have the sample LitServe API running.)

Step 3: Migrate API and model to LitServe

To accomplish the goal of migrating the book's inference API to using LitServe, you'll first copy server.py to a new football_server.py file with this command: cp server.py football_server.py.

Make these changes to the football_server.py file:

Add these imports:

import onnxruntime as rt

import numpy as np

Update the LitServe setup() method with the ONNX setup code from main.py:

def setup(self, device):
# This runs once at startup — same as your global session setup

self.sess_10 = rt.InferenceSession("acquisition_model_10.onnx",
providers=["CPUExecutionProvider"])
self.sess_50 = rt.InferenceSession("acquisition_model_50.onnx",
providers=["CPUExecutionProvider"])
self.sess_90 = rt.InferenceSession("acquisition_model_90.onnx",
providers=["CPUExecutionProvider"])

# Cache input/output names
self.input_name_10 = self.sess_10.get_inputs()[0].name
self.label_name_10 = self.sess_10.get_outputs()[0].name

self.input_name_50 = self.sess_50.get_inputs()[0].name
self.label_name_50 = self.sess_50.get_outputs()[0].name

self.input_name_90 = self.sess_90.get_inputs()[0].name
self.label_name_90 = self.sess_90.get_outputs()[0].name

Move the code from the main.py's predict() method to the football_server.py predict() function:

def predict(self, request):

# Extract features from the JSON body
waiver_value_tier = request["waiver_value_tier"]
weeks_remaining = request["fantasy_regular_season_weeks_remaining"]
budget_pct_remaining = request["league_budget_pct_remaining"]

input_data = np.array(
[[waiver_value_tier, weeks_remaining, budget_pct_remaining]],
dtype=np.int64,
)

# Perform ONNX inference
pred_onx_10 = self.sess_10.run(
[self.label_name_10], {self.input_name_10: input_data}
)[0]
pred_onx_50 = self.sess_50.run(
[self.label_name_50], {self.input_name_50: input_data}
)[0]
pred_onx_90 = self.sess_90.run(
[self.label_name_90], {self.input_name_90: input_data}
)[0]

return {
"winning_bid_10th_percentile": round(float(pred_onx_10[0]), 2),
"winning_bid_50th_percentile": round(float(pred_onx_50[0]), 2),
"winning_bid_90th_percentile": round(float(pred_onx_90[0]), 2),
}

With that, the API's ready to run as a LitServe API. Execute the API using uv run python football_server.py:

When prompted, open the API in the browser and add the "/docs" to the URL and you'll see the Swagger Docs:

For some reason, the Swagger UI isn't set up to work with the try it out feature, so we'll test with Curl using data that is correct for our model:

And we're done. We've migrated the model inference from our original FastAPI API to a LitServe one. Well done!

Benefits of LitServe

Since LitServe is built on top of FastAPI, what does LitServe provide that FastAPI doesn't? Here is what they show in their github repo:

They also have a cloud hosting product called Lightning Cloud that I hope to try out. I'll pass along what I learn in the Tip Sheet.

That's all for this week. Let me know if you give LitServe a try!

Keep coding,

Ryan Day

👉 https://tips.handsonapibook.com/ -- no spam, just a short email every week.

Ryan Day

This is my weekly newsletter where I share some useful tips that I've learned while researching and writing the book Hands-on APIs for AI and Data Science, a #1 New Release from O'Reilly Publishing

Read more from Ryan Day

Hi Tip-Sheeters, Today is Veteran's Day, and I want to start with appreciation for any service members who are reading this. The holiday was originally created to remember the armistice that ended the first World War in 1918. Peace is a blessing that we should never take for granted. Entering a Hackathon Quick and Dirty As data and IT pros, we are constantly honing our craft. One of the best ways is through taking on intense challenges or projects to drive our learning in a short period of...

Hi Tip-Sheeters, This week, I’m sharing a few of the questions that I’ve had this past year from readers, people at events, or folks who’ve reached out on LinkedIn. If you ever have a question or something I can assist with, please hit reply on the newsletter email, and I’ll be happy to share my thoughts. Let’s jump in! Q&A About Data, Tech, and Career Q: What is the new tool or framework you are learning about right now? If you’ve been keeping up with the Tip Sheet, you won’t be surprised...

Hi Tip-Sheeters, In Tip Sheet #22, I demonstrated creating an MCP server to connect to my Football API. This week, I'll update that demo to the latest version of FastMCP and MCP Cloud. It's another good chance to pitch you why you need an ongoing side project, which I'm calling an Anchor Project. Why You Need an Anchor Project ⚓💪 I learn best by building, and I've found that a lot of other tech and data people do as well. A valuable method for me is to build real-world projects, as I wrote...