Unpredictable multithreading behavior using HuggingFace and FastAPI with Uvicorn workers

Question

I am running inference on a hugging face model using FastAPI and Uvicorn.

The code looks roughly like this:

app = FastAPI() 
@app.post("/inference") 
async def func(text:str):     
    output = huggingfacepipeline(input.text)          
    return ...

I start the server like this: uvicorn app:app --host 0.0.0.0 --port 8080 --workers 4

The server has enough GPU (80GB).

What I expect to happen is each of the 4 workers gets its own GPU memory space and there are 4 CPU forks of the main thread, 1 for each worker. I can check the GPU memory allocation using nvidia-smi. So there should be 4 CPU forks and 4 processes in the GPU.

This ^ happens like clockwork when I use a smaller model (like GPT Neo 125m).

But when I use a larger model (like GPT-J in 16-bit), the behavior is often unpredictable. Sometimes there are 4 CPU forks but only 3 processes are in the GPU. Even though there is enough free space left over. Sometimes there is only 1 process in the GPU and 4 CPU forks.

What could be causing this and how do I diagnose further?

Does [this](https://stackoverflow.com/a/71517830/17865804) answer your question? — Chris, Jul 14 '23 at 09:18
That's a useful answer, I had read it earlier on. But no, it doesn't answer my question. My problem is - while running a large model, if I ask Uvicorn to create 4 workers, sometimes it creates only 3. Sometimes only 1. Sometimes 4 too. Sometimes it creates 4 workers in the CPU, but only 3 in the GPU. These workers are all created while starting the server before it receives any request. When I use a smaller model, the behavior is always predictable - always 4 workers in CPU and GPU. — ahron, Jul 14 '23 at 11:19

score 0 · Accepted Answer · answered Jul 26 '23 at 05:29

When using multiple workers, each workers gets its own copy of the model in GPU. Loading the models into GPU is a memory-intensive task. Loading N models into memory leads to frequent timeout errors. These errors can be seen in the output of dmesg.

Uvicorn doesn't have very good support for workers. When the worker times out, it doesn't continually try to reload it. Hence, frequently, only a smaller number of copies of the models (than the number of workers) is actually loaded into GPU.

The timeout errors are explicitly mentioned when Gunicorn is used. Using Gunicorn with 1) Uvicorn workers (because FastAPI is async) and 2) a high value for the --timeout option takes care of the problem.

Unpredictable multithreading behavior using HuggingFace and FastAPI with Uvicorn workers

1 Answers1