I would like to understand how to convert basic C/C++ loops to a CUDA kernel. Let's put it simple:
for (int i=0;i < MAXi;i++)
for(int j=0;j< MAXj;j++){
...code that uses i and j....
}
Every single i would need to compute MAXj elements. It could be very basic for some people but I am really struggling here. Let's say that Maxj is around a million, MAXj=1000000; and there is where we want all threads work. I have been successful with only the inner loop:
int tid=threadIdx.x + blockDim.x*blockIdx.x + blockDim.x*gridDim.x*blockIdx.y;
using 2d blocks, how can I parallelize this kind of loops? They are very common in C and it would be very useful to learn how to do it.