First up, I tried searching for this question before posting (thought people run into it frequently) , but could not find the same. I have multiple images to process and that processing is done across various kernels. For example
md = true;
while(md) {
kernel1<<<...>>>(image1, md);
kernel2<<<...>>>(image1, md); //image1 here is the image modified by kernel1
kernel3<<<...>>>(image1, md); //image1 here is the image modified by kernel2
}
md = true;
while(md) {
kernel1<<<...>>>(imageN, md);
kernel2<<<...>>>(imageN, md); //imageN here is the image modified by kernel1
kernel3<<<...>>>(imageN, md); //imageN here is the image modified by kernel2
}
The processing for a particular image stops when md for that image is set false by any kernel. The number of images are not fixed. I was wondering if I can process the images in parallel using streams? If yes, how will I know when one kernel belonging to a stream has finished and I should invoke the next kernel for that particular image? (Should I put it in an infinite while loop in the host machine). I was thinking of dynamic parallelism, but I am developing for CUDA compute capability 3.0. Thanks a lot for your time.
Edited:According to comment by VAnderi