I am running inference on a hugging face model using FastAPI and Uvicorn.
The code looks roughly like this:
app = FastAPI()
@app.post("/inference")
async def func(text:str):
output = huggingfacepipeline(input.text)
return ...
I start the server like this:
uvicorn app:app --host 0.0.0.0 --port 8080 --workers 4
The server has enough GPU (80GB).
What I expect to happen is each of the 4 workers gets its own GPU memory space and there are 4 CPU forks of the main thread, 1 for each worker. I can check the GPU memory allocation using nvidia-smi
. So there should be 4 CPU forks and 4 processes in the GPU.
This ^ happens like clockwork when I use a smaller model (like GPT Neo 125m).
But when I use a larger model (like GPT-J in 16-bit), the behavior is often unpredictable. Sometimes there are 4 CPU forks but only 3 processes are in the GPU. Even though there is enough free space left over. Sometimes there is only 1 process in the GPU and 4 CPU forks.
What could be causing this and how do I diagnose further?