Deploying model as service using FastAPI and Gunicorn with high throughput

Question

I am trying to deploy a machine learning model (based on tensorflow and keras) as a service using FastAPI and Gunicorn, but I am not able to get enough throughput from the API even after increasing number of Gunicorn workers and threads.

I have tried with the following configs:

1 Worker: gunicorn model:app -k uvicorn.workers.UvicornWorker -b hostname:port This is giving me a throughput of 15 responses/sec
5 workers: gunicorn model:app -k uvicorn.workers.UvicornWorker --workers=5 -b hostname:port This is giving me a throughput of 30 responses/sec

30 responses/sec is the maximum throughput I am able to get, while I have to scale it to around 300 responses/sec. I have tried increasing the number of threads too, but that did not result in increase in throughput either.

When I am timing the request-response with single worker: it takes around 80ms for the response to return (Done through Postman)

I am trying to run this on Linux machine with the following details:

OS - CentOS
CPU(s) - 8
Core(s) per socket - 4
Thread(s) per core - 2
Memory - ~65Gig

The system is almost idle when I am trying to run the service (less than 5% CPU usage).

It’s going to be continuous data streams, with a velocity of around 300 records per second — Ankit Sahay, Aug 16 '20 at 00:23
I think it is the problem can you try again with async functions? — Yagiz Degirmenci, Aug 16 '20 at 00:29
Ok, the throughput definitely went up. With 5 workers, its 60 messages/sec, but nowhere close to 300 messages/sec. All I did was added “async” before the function being invoked for the post request. But that function makes a call to a different method of an object too. Should I make all function in my app as “async”. What really changed? — Ankit Sahay, Aug 16 '20 at 00:39
Yes, if it doesn't breaks the system, in situations like your database doesnt support async etc. you can use async safely. What changed let me explain, with synchronous functions which is default function waits for the queued job to return imagine like stack data structure you are basically adding objects to the stack and waiting each one to end till there is no object left in the stack but instead of adding jobs to stack and waiting for them to end you can run your function independently with async. — Yagiz Degirmenci, Aug 16 '20 at 00:51
Also there is one other trick that you can use if you are processing so much objects and returning a json. — Yagiz Degirmenci, Aug 16 '20 at 00:59
You can use [orjson](https://github.com/ijl/orjson) library which is extremely faster and FastAPI has a `ORJSONResponse` class itself, you can change your default response class to `ORJSONResponse` you just need to install orjson and add a `app = FastAPI(default_response_class=ORJSONResponse)` — Yagiz Degirmenci, Aug 16 '20 at 01:06
Please let me know your latest throughput with all async functions and ORJSONResponse — Yagiz Degirmenci, Aug 16 '20 at 01:09
I did try, I don't see any improvement by using ORJSON. Its still wokring at 60 messages/sec. — Ankit Sahay, Aug 16 '20 at 10:40
Just a dummy question: have you tried an ASGI web server such as uvicorn? Gunicorn is a WSGI, which probably slows down because is synchronous. — lsabi, Aug 16 '20 at 20:25
I just did. I don’t see any improvements. With 5 workers, the rate is still 60 messages/sec. — Ankit Sahay, Aug 16 '20 at 20:34
@Isabi running with uvicorn is much faster, in depth it still works as ASGI with the uvicorn workers, see this [answer](https://stackoverflow.com/questions/62976648/architecture-flask-vs-fastapi/62977786#62977786) — Yagiz Degirmenci, Aug 16 '20 at 22:09

Deploying model as service using FastAPI and Gunicorn with high throughput

0 Answers0