GPU out of memory when FastAPI is used with SentenceTransformers inference

Question

I'm currently using FastAPI with Gunicorn/Uvicorn as my server engine. Inside FastAPI GET method I'm using SentenceTransformer model with GPU:

# ...

from sentence_transformers import SentenceTransformer

encoding_model = SentenceTransformer(model_name, device='cuda')

# ...
app = FastAPI()

@app.get("/search/")
def encode(query):
    return encoding_model.encode(query).tolist()

# ...

def main():
    uvicorn.run(app, host="127.0.0.1", port=8000)


if __name__ == "__main__":
    main()

I'm using the following config for Gunicorn:

TIMEOUT 0
GRACEFUL_TIMEOUT 120
KEEP_ALIVE 5
WORKERS 10

Uvicorn has all default settings, and is started in docker container casually:

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

So, inside docker container I have 10 gunicorn workers, each using GPU.

The problem is the following:

After some load my API fails with the following message:

torch.cuda.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 734.00 MiB 
(GPU 0; 15.74 GiB total capacity; 
11.44 GiB already allocated; 
189.56 MiB free; 
11.47 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The error you posted clearly states the problem, i.e., _"Tried to allocate 734.00 MiB..."_ but, _"189.56 MiB free;"_. As described in [this answer](https://stackoverflow.com/a/71517830/17865804) and [this answer](https://stackoverflow.com/a/71613757/17865804), workers s do not share the same memory, and hence, each worker will load their own instance of the ML model (as well as other variables in your code) into memory. If you are using 10 workers, the model will result in being loaded 10 times into RAM. Have a look at the links above for more details and solutions. — Chris, Jan 09 '23 at 17:40
@Chris You are right. It helped. I used celery as RPC manager (rabbitmq+redis backend setup) and a separate container for GPU-bound computations, so there is only one instance of my model on GPU. — Nick Zorander, Feb 16 '23 at 08:48

score 3 · Accepted Answer · answered Feb 16 '23 at 08:53

The problem was that there were 10 replicas of my transformer model on GPU, as @Chris mentioned above. My solution was to use celery as RPC manager (rabbitmq+redis backend setup) and a separate container for GPU-bound computations, so now there is only one instance of my model on GPU, and no race between different processes' models.

GPU out of memory when FastAPI is used with SentenceTransformers inference

1 Answers1