I am trying to deploy a machine learning model (based on tensorflow and keras) as a service using FastAPI and Gunicorn, but I am not able to get enough throughput from the API even after increasing number of Gunicorn workers and threads.
I have tried with the following configs:
1 Worker: gunicorn model:app -k uvicorn.workers.UvicornWorker -b hostname:port This is giving me a throughput of 15 responses/sec
5 workers: gunicorn model:app -k uvicorn.workers.UvicornWorker --workers=5 -b hostname:port This is giving me a throughput of 30 responses/sec
30 responses/sec is the maximum throughput I am able to get, while I have to scale it to around 300 responses/sec. I have tried increasing the number of threads too, but that did not result in increase in throughput either.
When I am timing the request-response with single worker: it takes around 80ms for the response to return (Done through Postman)
I am trying to run this on Linux machine with the following details:
- OS - CentOS
- CPU(s) - 8
- Core(s) per socket - 4
- Thread(s) per core - 2
- Memory - ~65Gig
The system is almost idle when I am trying to run the service (less than 5% CPU usage).