I'm trying to implement an efficient way of doing concurrent inference in Pytorch.
Right now, I start 2 processes on my GPU (I have only 1 GPU, both process are on the same device). Each process load my Pytorch model and do the inference step.
My problem is that my model takes quite some space on the memory. I have 12Gb of memory on the GPU, and the model takes ~3Gb of memory alone (without the data). Which means together, my 2 processes takes 6Gb of memory just for the model.
Now I was wondering if it's possible to load the model only once, and use this model for inference on 2 different processes. What I want is only 3Gb of memory is consumed by the model, but still have 2 processes.
I came accross this answer mentioning IPC, but as far as I understood it means the process #2 will copy the model from process #1, so I will still end up with 6Gb allocated for the model.
I also checked on the Pytorch documentation, about DataParallel and DistributedDataParallel, but it seems not possible.
This seems to be what I want, but I couldn't find any code example on how to use with Pytorch in inference mode.
I understand this might be difficult to do such a thing for training, but please note I'm only talking about the inference step (the model is in read-only mode, no need to update gradients). With this assumption, I'm not sure if it's possible or not.