Hi Tip-Sheeters,

One of the key areas where APIs are used in machine learning is for model inference. This is where the model is made available for real-time API calls to receive predictions. This is one of the two primary modes of model inference (along with batch inference).

Tip Sheet #12 listed some of the top Python libraries worth exploring. One of those was LitServe, which is a framework for hosting models with a FastAPI backend. This week, I'll demonstrate this framework using my anchor project as an example.

LitServe: A framework for inference engines

The Top Python Libraries article I referenced in Tip Sheet #12 gave this summary of the value of LitServe:

Built on top of the ubiquitous FastAPI, LitServe transforms the process of serving AI models, adding powerful features like batching, streaming, and GPU autoscaling. With its intuitive design and optimizations, LitServe lets you deploy anything from classic machine learning models to large language models (LLMs) with ease—all while being at least 2x faster than a plain FastAPI setup.

Using FastAPI for serving your inference API is a great place to start. Let's see what additional goodies we get from using LitServe.

Creating a LitServe Project

In Chapter 13 of Hands-on APIs for AI and Data Science, you create an ML model to predict the cost of fantasy football waiver wire picks. The model is created with Scitkit-Learn and stored in the ONNX format. The API is created with FastAPI and uses the ONNX Runtime to speed up inference. We'll use these elements as the base for our project and examine how to migrate them to a LitServe API. All of the completed code is available in the Tip Sheet Github Repo.

Step 1: Setup existing FastAPI API

FastAPI has a new option for building a FastAPI project with uv that I wanted to try out. So I launched a new GitHub Codespace and installed uv with the pip install uv command.

Then I ran the uvx fastapi-new command to create a new FastAPI project named fastapi-model-serving:

This command creates a basic pyproject.toml file. Because my model inference needs some more libraries, I edited this manually to add a few more (in retrospect, I'm not sure if the scikit-learn and skl2onnx are needed in the API or if those were for generating the ONNX model):

Adding libraries to the pyproject.toml file

Now we'll install the required libraries in our runtime environment using pip install:

The fastapi-new command generates a basic "hello world" API, now you copy over the API code and ML models (three actually). The files are:

acquisition_model_10.onnx - 10th percentile model

acquisition_model_50.onnx - 50th percentile model

acquisition_model_90.onnx - 90th percentile model

main.py - FastAPI program

schemas.py - Pydantic schemas

You can run the FastAPI at this point with uv run fastapi dev main.py to verify the FastAPI setup is working.

(At this point, you've verified the API from the book in this environment, with the new fastapi-new setup command.)

Step 2: Run toy LitServe Example

To get LitServe running in your Codespace, follow the instructions for LitServe quick start. First, we'll add the litserve library to the pyproject.toml dependencies and run pip install again.

Now create the file server.py using the toy example from the litserve documentation:

Run the litserve example with uv run python server.py:

You can test the example by calling it with curl:

Testing the example litserve API with curl

(At this point, you've confirmed you have the sample LitServe API running.)

Step 3: Migrate API and model to LitServe

To accomplish the goal of migrating the book's inference API to using LitServe, you'll first copy server.py to a new football_server.py file with this command: cp server.py football_server.py.

Make these changes to the football_server.py file:

Add these imports:

import onnxruntime as rt

import numpy as np

Update the LitServe setup() method with the ONNX setup code from main.py:

def setup(self, device): # This runs once at startup — same as your global session setup self.sess_10 = rt.InferenceSession("acquisition_model_10.onnx", providers=["CPUExecutionProvider"]) self.sess_50 = rt.InferenceSession("acquisition_model_50.onnx", providers=["CPUExecutionProvider"]) self.sess_90 = rt.InferenceSession("acquisition_model_90.onnx", providers=["CPUExecutionProvider"]) # Cache input/output names self.input_name_10 = self.sess_10.get_inputs()[0].name self.label_name_10 = self.sess_10.get_outputs()[0].name self.input_name_50 = self.sess_50.get_inputs()[0].name self.label_name_50 = self.sess_50.get_outputs()[0].name self.input_name_90 = self.sess_90.get_inputs()[0].name self.label_name_90 = self.sess_90.get_outputs()[0].name

Move the code from the main.py's predict() method to the football_server.py predict() function:

def predict(self, request): # Extract features from the JSON body waiver_value_tier = request["waiver_value_tier"] weeks_remaining = request["fantasy_regular_season_weeks_remaining"] budget_pct_remaining = request["league_budget_pct_remaining"] input_data = np.array( [[waiver_value_tier, weeks_remaining, budget_pct_remaining]], dtype=np.int64, ) # Perform ONNX inference pred_onx_10 = self.sess_10.run( [self.label_name_10], {self.input_name_10: input_data} )[0] pred_onx_50 = self.sess_50.run( [self.label_name_50], {self.input_name_50: input_data} )[0] pred_onx_90 = self.sess_90.run( [self.label_name_90], {self.input_name_90: input_data} )[0] return { "winning_bid_10th_percentile": round(float(pred_onx_10[0]), 2), "winning_bid_50th_percentile": round(float(pred_onx_50[0]), 2), "winning_bid_90th_percentile": round(float(pred_onx_90[0]), 2), }

With that, the API's ready to run as a LitServe API. Execute the API using uv run python football_server.py:

When prompted, open the API in the browser and add the "/docs" to the URL and you'll see the Swagger Docs:

For some reason, the Swagger UI isn't set up to work with the try it out feature, so we'll test with Curl using data that is correct for our model:

And we're done. We've migrated the model inference from our original FastAPI API to a LitServe one. Well done!

Benefits of LitServe

Since LitServe is built on top of FastAPI, what does LitServe provide that FastAPI doesn't? Here is what they show in their github repo:

They also have a cloud hosting product called Lightning Cloud that I hope to try out. I'll pass along what I learn in the Tip Sheet.

That's all for this week. Let me know if you give LitServe a try!

Keep coding,

Ryan Day

👉 https://tips.handsonapibook.com/ -- no spam, just a short email every week.

Ryan Day

Tip Sheet #51: Serving an ML model with LitServe

LitServe: A framework for inference engines

Creating a LitServe Project

Step 1: Setup existing FastAPI API

Step 2: Run toy LitServe Example

Step 3: Migrate API and model to LitServe

Benefits of LitServe

Tip Sheet #50: What are you waiting for -- enter a hackathon already

Tip Sheet #48: Q&A about AI, Data,Tech, and Career

Tip Sheet #44: Using FastMCP Cloud