1

First up, I tried searching for this question before posting (thought people run into it frequently) , but could not find the same. I have multiple images to process and that processing is done across various kernels. For example

md = true;
while(md) {
    kernel1<<<...>>>(image1, md);
    kernel2<<<...>>>(image1, md); //image1 here is the image modified by kernel1
    kernel3<<<...>>>(image1, md); //image1 here is the image modified by kernel2
}

md = true;
while(md) {
    kernel1<<<...>>>(imageN, md);
    kernel2<<<...>>>(imageN, md); //imageN here is the image modified by kernel1
    kernel3<<<...>>>(imageN, md); //imageN here is the image modified by kernel2
}

The processing for a particular image stops when md for that image is set false by any kernel. The number of images are not fixed. I was wondering if I can process the images in parallel using streams? If yes, how will I know when one kernel belonging to a stream has finished and I should invoke the next kernel for that particular image? (Should I put it in an infinite while loop in the host machine). I was thinking of dynamic parallelism, but I am developing for CUDA compute capability 3.0. Thanks a lot for your time.

Edited:According to comment by VAnderi

  • So kernel1 changes image1 and then kernel2 changes the changed image1 for kernel3? – VAndrei Oct 14 '14 at 11:27
  • Yes, kernel1 changes image1, the resulting image1 is given to kernel2 and then the resulting image1 is given to kernel3. – Andrew Mathews Oct 14 '14 at 11:34
  • you could use a CPU parallel threading model, like OpenMP, and create one stream for each OMP thread. Place one while loop in each OMP thread, and have the while loops individually draw new images to be processed from a queue. I'd be very surprised if you get much performance improvement this way, unless your kernels are trivially small. – Robert Crovella Oct 16 '14 at 02:42
  • Sorry, I was stuck with fine tuning the algorithm itself - which had nothing to do with CUDA, hence the delay in reply. Why do you say - "I'd be very surprised if you get much performance improvement this way, unless your kernels are trivially small." What is the reason for it? Each image of mine has either 230x230 pixels or 16384x7 pixels. So parallel processing multiple images should give me speedup right? (Is there no way to do it without using OpenMP? – Andrew Mathews Oct 22 '14 at 05:45

1 Answers1

1

I think you can use CUDA streams for this task but it should pay off if you have multiple images.

For example you can create 2 streams, one that processes odd numbered images and one that processes even numbered images. In each stream you "enqueue" kernel1, kernel2 and kernel3 and this way you can control that kernel 2 waits kernel 1 and so on. See this presentation.

The stream behaves like a queue. If you push the kernels into the stream, they will run in the order you enqueued them. See this post for more information.

I don't recommend putting kernel 1, 2, 3 on different streams since it makes the situation worse.

Regarding dynamic parallelism this is more to overlap memory copies with kernels working on another data set. You could squeeze more performance out of this if you copy the next set of images while processing the current one in the kernels.

Community
  • 1
  • 1
VAndrei
  • 5,420
  • 18
  • 43
  • Thank you. Will look into the presentation. But its not yet clear to me, how to do it without an infinite loop. Suppose I launch 2 streams - one for each image. The CPU should invoke kernel1, kernel2, kernel3 many times depending on the value of md. If it were only one time invocation, I do understand this. – Andrew Mathews Oct 14 '14 at 11:40
  • Depending on the value of md, I decide whether to refine the image further or not.[link](http://stackoverflow.com/questions/26324059/process-an-image-multiple-times-in-cuda) – Andrew Mathews Oct 14 '14 at 12:05
  • Your approach is ok then. If you are doing some sort of filtering we are not really talking about an infinite loop. It will take a small number of iterations probably. However make sure you make each thread launch "blocking". – VAndrei Oct 14 '14 at 12:19
  • Thank you. How do I check for the completion of a stream in the while loop? – Andrew Mathews Oct 14 '14 at 12:33
  • cudaStreamSynchronize function could help. See more info here: http://stackoverflow.com/questions/5107265/get-rid-of-busy-waiting-during-asynchronous-cuda-stream-executions – VAndrei Oct 14 '14 at 12:51