We are trying to decrease the latency of our BERT model prediction service that is deployed using FastAPI. The predictions are called through the /predict
endpoint. We looked into the tracing and found one of the bottlenecks is the prometheus-fastapi-instrumentator
. About 1% of the requests do timeout
because they exceed 10s.
We also discovered that some metrics are not getting reported on 4 requests/second. Some requests took 30-50 seconds, with the starlette/fastapi
taking long times. So it seems that under high usage, the /metrics
endpoint doesn't get enough resources, and hence all /metrics
requests wait for some time and fail eventually. So having separate container for metrics could help. Or if possible to have metrics delayed/paused under high load. Any insight/guidance would be much appreciated.
Code Example:
This is a template I used to build my FastAPI prediction service. The only difference is I'm using a BERT based model instead of the simple model used in the template.