Serving multiple TF2/Keras models from multiple threads but with one GPU

Question

I am writing a Python server application that will be used to serve Keras (i.e. TensorFlow 2) models to its clients. Or better: A client will send an image to analyze to this server, the server calls Keras predict() on that image and returns the result of that model inference back to the client.

Multiple different Keras models will be available on the server. A client can thus not only send the image to be analyzed to the server, it also sends the type of this image, which in turn tells the server which Keras model to use. For example, the client can send an image cat_or_dog.jpg to the server and additionally a flag "animals". That way the server knows that it should call the "animals" Keras model to run predict(). Something like:

self.keras_models["animals"].predict(cat_or_dog.jpg)

The server is a standard TCP server, i.e. all communication with the clients is done via TCP as well. It will be able to serve multiple clients from multiple TCP connections, using multiple Python threads. Of course requests from clients can come in concurrntly. Also note that Keras/TF model inference takes place on a GPU, the server has a Tesla T4 GPU built-in.

So my questions: What do I need to watch out for when it comes to handling such multiple concurrent client requests? I am not talking about the TCP-side of my applicatiom, that works fine already. But what about the Keras models, running in parallel on different threads but with only one GPU - do I need to "lock" that GPU somehow, or is this done automatically by Python/TF? Is a call to predict() atomic, or could it be interrupted from another thread? Moreover, what if multiple threads simultaneously call the server with the same model (e.g. models["animals"]), do I need to create a copy o the TF model for each thread? Any other things I need to watch out for?

So you want to share one GPU with multiple models and then run inference through them. One way to understand this is that GPU is already running many things parallel so you need to specify how much GPU memory does each process get keeping in mind that not full GPU is available. You do not need to create a copy of TF model , each request to the same model will be handled sequentially as it is going to share the same graph in the same session, you can use NGinx or likewise to efficiently handle multiple requests to different models. — Ashwin Phadke, Jun 22 '20 at 16:20
@AshwinPhadke So is a call to `predict()` atomic? i.e. what happens if thread A calls it on one Keras model, than "in between" Python executes another thread B and that one calls `predict()` on the very same Keras model object? Wouldn't that 2nd call then infer with the first call running (but suspended) in thread A? — Matthias, Jun 22 '20 at 16:25
[1](https://stackoverflow.com/questions/60905801/parallelizing-model-predictions-in-keras-using-multiprocessing-for-python) [2](https://stackoverflow.com/questions/36610290/tensorflow-and-multiprocessing-passing-sessions) refer these if you have doubts. — Ashwin Phadke, Jun 22 '20 at 16:48

Serving multiple TF2/Keras models from multiple threads but with one GPU

0 Answers0