11

I'm trying to implement an efficient way of doing concurrent inference in Pytorch.

Right now, I start 2 processes on my GPU (I have only 1 GPU, both process are on the same device). Each process load my Pytorch model and do the inference step.

My problem is that my model takes quite some space on the memory. I have 12Gb of memory on the GPU, and the model takes ~3Gb of memory alone (without the data). Which means together, my 2 processes takes 6Gb of memory just for the model.


Now I was wondering if it's possible to load the model only once, and use this model for inference on 2 different processes. What I want is only 3Gb of memory is consumed by the model, but still have 2 processes.


I came accross this answer mentioning IPC, but as far as I understood it means the process #2 will copy the model from process #1, so I will still end up with 6Gb allocated for the model.

I also checked on the Pytorch documentation, about DataParallel and DistributedDataParallel, but it seems not possible.

This seems to be what I want, but I couldn't find any code example on how to use with Pytorch in inference mode.


I understand this might be difficult to do such a thing for training, but please note I'm only talking about the inference step (the model is in read-only mode, no need to update gradients). With this assumption, I'm not sure if it's possible or not.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Astariul
  • 2,190
  • 4
  • 24
  • 41
  • 2
    I don't see why you cannot just use the same (read-only) model for your inference. You can pass different data batches into the same model, the data loading and inferences can be in parallel. Multiple users can also talk to the model through a higher level interface. Where are the bottlenecks that cause you to use two processes? – THN Feb 05 '20 at 07:43
  • Thanks for your comment @THN. I currently start my 2 processes, load the model in each of them, and infer. Since process cannot share memory, how would you do ? Using threads ? – Astariul Feb 05 '20 at 08:20
  • 1
    I would use one process to load one model and do inference. That will work for most purposes. What exactly do you want to achieve? – THN Feb 05 '20 at 08:54
  • I use several processes to achieve better concurrency. The issue is that with my way to do (1 model per process), then several models are loaded into a single GPU, wasting memory. I wonder if it's possible to load the model once, and use it by several processes. – Astariul Feb 05 '20 at 09:00
  • 1
    You can get most of the benefit of concurrency with a single model on a single process, by doing the concurrency in data loading (which is separated from the model running process, this can be done manually; `tensorflow` has native support for optimal parallel data preloading, you can look into it for an example) and processing (automatically by larger batch). – THN Feb 05 '20 at 09:34
  • What if you run such a process and then `fork` it into two different processes, each of them acts as a server and starts listening on a different socket. Will it be good? The problem is that by sharing GPU memory you'll have to synchronize the two processes to not use the (same) GPU memory at the same time. – Raz Rotenberg Feb 05 '20 at 14:57
  • How are you structuring this that two processes would offer better concurrency than one? An individual pytorch model running on a GPU is already highly concurrent, utilizing thousands of GPU threads. – nairbv Feb 05 '20 at 15:16
  • @nairbv Yes it's using GPU threads and concurrent for 1 inference. But I want to run 2 inference at the same time. In that case 2 processes offer the possibility of running 2 inferences at the same time – Astariul Feb 05 '20 at 23:33
  • 1
    @THN I didn't know `you get most of the benefit of concurrency with a single model on a single process`. I thought that, if memory allows it, it's more efficient to load 2 processes, so they can run in parallel. Please post an answer ! – Astariul Feb 05 '20 at 23:41

2 Answers2

2

The GPU itself has many threads. When performing an array/tensor operation, it uses each thread on one or more cells of the array. This is why it seems that an op that can fully utilize the GPU should scale efficiently without multiple processes -- a single GPU kernel is already massively parallelized.

In a comment you mentioned seeing better results with multiple processes in a small benchmark. I'd suggest running the benchmark with more jobs to ensure warmup, ten kernels seems like too small of a test. If you're finding a thorough representative benchmark to run faster consistently though, I'll trust good benchmarks over my intuition.

My understanding is that kernels launched on the default CUDA stream get executed sequentially. If you want them to run in parallel, I think you'd need multiple streams. Looking in the PyTorch code, I see code like getCurrentCUDAStream() in the kernels, which makes me think the GPU will still run any PyTorch code from all processes sequentially.

This NVIDIA discussion suggests this is correct:

https://devtalk.nvidia.com/default/topic/1028054/how-to-launch-cuda-kernel-in-different-processes/

Newer GPUs may be able to run multiple kernels in parallel (using MPI?) but it seems like this is just implemented with time slicing under the hood anyway, so I'm not sure we should expect higher total throughput:

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

If you do need to share memory from one model across two parallel inference calls, can you just use multiple threads instead of processes, and refer to the same model from both threads?

To actually get the GPU to run multiple kernels in parallel, you may be able to use nn.Parallel in PyTorch. See the discussion here: https://discuss.pytorch.org/t/how-can-l-run-two-blocks-in-parallel/61618/3

nairbv
  • 4,045
  • 1
  • 24
  • 26
1

You can get most of the benefit of concurrency with a single model in a single process for (read-only) inference, by doing concurrency in data loading and model inference.

Data loading is separated from the model running process, this can be done manually. As far as I know, tensorflow has some native supports for optimal parallel data preloading, you can look into it for an example.

Model inference is automatically parallel on GPU. You can maximize this concurrency by using larger batches.

From an architectural point of view, multiple users can also talk to the model through a higher level interface.

THN
  • 3,351
  • 3
  • 26
  • 40
  • I'm wondering how following case would be handled : inference takes 2 seconds, 2 users request inference almost at the same time. Request #1 is inferred for 2 seconds. Then request #2 is inferred for 2 seconds. So user #2 had to wait 4 seconds for his request. Isn't it better, in this case, to have 2 process on the GPU ? So user's #2 request takes just 2 seconds since we have a process available. – Astariul Feb 06 '20 at 06:39
  • You should look at the job scheduling problem which is well studied in OS and has several algorithms. In practice, jobs do not come at the same time, so you can process this job while loading another one. If necessary, you can batch the job together, or just process in sequence if the waiting time is negligible, or divide each job if it is too large. – THN Feb 07 '20 at 00:02
  • I did some benchmark for my specific case : If 10 clients request a prediction, it takes 0.96 s to serve all of them with 2 processes on the same GPU. The same experiment with only a single process takes 1.42 s – Astariul Feb 10 '20 at 07:29
  • It's good that you actually tested, but note that each result is an anecdote. If all requests come at the same time, and they only consume a negligible part of the GPU, and you process each request separately, then it is certainly that using 2 or more processes would be faster. But there are cases that one process is good enough, such as when requests come at random; or case that one process is better, such as when the model is large and the request can be batched together. After all you need to look at your own typical use case, find the bottlenecks, and decide where to optimize. – THN Feb 10 '20 at 10:40
  • Using multiple CPU processes to read requests, load data, and batch them together, then run it on one GPU process, is the same as your original question about sharing memory (which is actually model params) on GPU. You still need to work for it. – THN Feb 10 '20 at 10:54
  • FYI PyTorch also has parallel data loaders – nairbv Feb 10 '20 at 14:56
  • @nairbv it's good to hear. Does this optimize the data prefetch pipeline? Please link to the documentation/tutorials if you know. – THN Feb 12 '20 at 04:30