5

We are trying to decrease the latency of our BERT model prediction service that is deployed using FastAPI. The predictions are called through the /predict endpoint. We looked into the tracing and found one of the bottlenecks is the prometheus-fastapi-instrumentator. About 1% of the requests do timeout because they exceed 10s.

We also discovered that some metrics are not getting reported on 4 requests/second. Some requests took 30-50 seconds, with the starlette/fastapi taking long times. So it seems that under high usage, the /metrics endpoint doesn't get enough resources, and hence all /metrics requests wait for some time and fail eventually. So having separate container for metrics could help. Or if possible to have metrics delayed/paused under high load. Any insight/guidance would be much appreciated.

enter image description here

enter image description here

enter image description here

Code Example:

This is a template I used to build my FastAPI prediction service. The only difference is I'm using a BERT based model instead of the simple model used in the template.

Riley Hun
  • 2,541
  • 5
  • 31
  • 77
  • The chain of middlewares, like a nesting doll, call each other and the latter calls the path operation code, waits for it to return and also, in the reverse order, the control returns to all the middlewares, so I will assume that the matter is in the code of the path operation itself, could you share it in a minimally understandable version – alex_noname Dec 06 '21 at 14:54
  • @alex_noname - added a template version – Riley Hun Dec 06 '21 at 18:12
  • It seems that not added – alex_noname Dec 06 '21 at 18:23
  • Apologies - added now – Riley Hun Dec 06 '21 at 18:28
  • Did you estimate pure `model.predict` execution time ? – alex_noname Dec 06 '21 at 20:37
  • It varies depending on length of text, but less than 1 second – Riley Hun Dec 06 '21 at 20:46
  • 1
    At first, I can advise to try to [use](https://fastapi.tiangolo.com/async/#path-operation-functions) `def predict` instead of `async def`. Then you can try to [use](https://stackoverflow.com/questions/63169865/how-to-do-multiprocessing-in-fastapi/63171013#63171013) multiprocessing for predict. And recheck your estimations. – alex_noname Dec 07 '21 at 04:18

0 Answers0