I am writing a Python server application that will be used to serve Keras (i.e. TensorFlow 2) models to its clients. Or better: A client will send an image to analyze to this server, the server calls Keras predict()
on that image and returns the result of that model inference back to the client.
Multiple different Keras models will be available on the server. A client can thus not only send the image to be analyzed to the server, it also sends the type of this image, which in turn tells the server which Keras model to use. For example, the client can send an image cat_or_dog.jpg
to the server and additionally a flag "animals". That way the server knows that it should call the "animals" Keras model to run predict()
. Something like:
self.keras_models["animals"].predict(cat_or_dog.jpg)
The server is a standard TCP server, i.e. all communication with the clients is done via TCP as well. It will be able to serve multiple clients from multiple TCP connections, using multiple Python threads. Of course requests from clients can come in concurrntly. Also note that Keras/TF model inference takes place on a GPU, the server has a Tesla T4 GPU built-in.
So my questions: What do I need to watch out for when it comes to handling such multiple concurrent client requests? I am not talking about the TCP-side of my applicatiom, that works fine already. But what about the Keras models, running in parallel on different threads but with only one GPU - do I need to "lock" that GPU somehow, or is this done automatically by Python/TF? Is a call to predict()
atomic, or could it be interrupted from another thread? Moreover, what if multiple threads simultaneously call the server with the same model (e.g. models["animals"]
), do I need to create a copy o the TF model for each thread? Any other things I need to watch out for?