FastAPI with uvicorn won't allow thread usage to exceed ~65%

Question

I have written a machine learning inference library which has python bindings. Under normal operation, this library will use 8 threads for inference, and will max out all 8 threads 100%. This is the desired behavior as the model is very heavy and I need to optimize for low latency (therefore I need to use all CPU resources).

If I write a python script and make calls to the inference function in this library (in an infinite loop), the 8 threads are maxed out as expected (here's the output of the htop command).

Now here's where I'm having an issue. I need to call this machine learning library from within a FastAPI server which I have written. I am using the following command from within my docker container to launch the FastAPI server CMD uvicorn main:app --host 0.0.0.0 --port 8080. As can be seen, I make use of uvicorn.

Now, here is where things get interesting. If I call the same inference function in the machine learning library, once again in an infinite loop, but this time from within one of my FastAPI endpoints, then the CPU usage is capped at ~65% per thread and won't exceed this.

Any thoughts on why the CPU usage is being capped? I would like to allow it to hit 100% to make full use of the CPU. I am getting performance loss due to the CPU usage being capped.

My guess is that when you run under `uvicorn`, there are other threads running that suck time away from the inference threads. When other threads are introduced into an environment, this can happen quite easily due to the [GIL](https://realpython.com/python-gil). This is just a guess, as I don't know all the details of your setup. To get around this, it often make sense to switch from a multi-threaded model to a multi-process model. In your case, you could possibly just spawn a separate process that runs your inference threads to decouple them from the main runtime environment. — CryptoFool, Oct 27 '22 at 23:13
That's a good thought, I may test that out. However, the ML library I have written is in C++ and is threadsafe. Therefore, in the pybind11 python bindings layer (where the C++ method is called), I release the python GIL: `py::gil_scoped_release release;` — cyrusbehr, Oct 27 '22 at 23:21
You need to provide a [mre] and debugging details. Try to make test modules such as burning the CPU with pure Python, with pure C extension, with pybind11 C extension, etc. I mean a simple loop such as ```a = 0; while True: a += 1``` — relent95, Oct 28 '22 at 05:27
A solution (which usually is the preferred way as soon as you started getting more load on the service) would be to move the ML part out to its own process - and not run it inside the uvicorn/fastapi process hierarchy. Instead, use a queue - put a request in the queue when it appears, pop them from a queue in your ML worker (which would be a separate set of processes) and then return the result back to the caller through the queueing system (or out of band through redis/a database/etc.). That allows you to scale the two parts of the system as necessary by themselves. — MatsLindh, Oct 28 '22 at 07:50

score 1 · Accepted Answer · answered Nov 02 '22 at 19:25

I was able to determine what the issue was.

I had defined my endpoints as def, and according to the FastAPI documentation, it will dispatch the request to a threadpool. This must have caused some sort of CPU contention, presumably related to the Python GIL. Switching the endpoints to async def solved the problem and allowed each of the 8 threads to reach 100% CPU usage. It therefore reduced the latency by about 30%.

FastAPI with uvicorn won't allow thread usage to exceed ~65%

1 Answers1