Large for loop in Cuda kernel doesn't work for large arrays

Question

I've implemented various algorithms using Cuda, such as matrix multiplication, Cholesky decomposition and inversion (by forward substitution) of a lower triangular matrix.

For some of these algorithms I have a for loop in the kernel that repeats part of the kernel code lots of times. It all works well for (flattened: represented by 1D arrays) matrices (of floats) up to about 200x200, with the for loop calling part of the kernel code 200 times. Increasing the matrix size to say 1000x1000 (with the for loop calling part of the kernel code 1000 times) leaves the GPU to take as much computing time as can be expected based on trials with smaller matrix sizes. But no kernel code (including parts outside the for loop) seems to have been run (the output matrix has none of its elements changed since initialization). If I increase the matrix size to around 500 I'm sometimes able to get the kernel to run if I set the limiter in the for loop to some low value (such has 3).

Have I hit some hardware limit here or is there a trick I can use to make these for loops work for large matrices?

This is an example of complete code that you can copy into a .cu file. The kernel attempts to copy the contents of matrix A (W*H) to matrix B (W*H). The output shows the first element of both matrices, for W*H < 200x200 this works just fine, for W*H = 1000x1000 no copying seems to occur because the elements of B remain zero, as if nothing happened since initialization. I'm compiling and running this code on a linux based server. For large matrices error checking gives me: "GPUassert: unspecified launch failure" at line 67 which is the cudamempcy line that copies matrix B from device to host.

  #include <cuda.h>
  #include <cuda_runtime.h>
  #include <cuda_runtime_api.h>
  #include <stdio.h> 
  #include <stdlib.h>
  #include <math.h>
  #include <iostream>
  #include <time.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}  

__global__ void MatrixCopy(float *A, float *B, int W)
{

int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;

B[j*W + i]=A[j*W + i];

}

int main(void)
{

clock_t start1=clock();

int W=1000;
int H=1000;
float *A, *B;
float *devA, *devB;

A=(float*)malloc(W*H*sizeof(float));
B=(float*)malloc(W*H*sizeof(float));

for(int i=0; i<=W*H; i++)
{
    A[i]=rand() % 3;
    A[i]=A[i]+1;
    B[i]=0;
}

gpuErrchk( cudaMalloc( (void**)&devA, W*H*sizeof(float) ) ); 
gpuErrchk( cudaMalloc( (void**)&devB, W*H*sizeof(float) ) ); 

gpuErrchk( cudaMemcpy( devA, A, W*H*sizeof(float), cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMemcpy( devB, B, W*H*sizeof(float), cudaMemcpyHostToDevice ) );

dim3 threads(32,32);
int bloW=(int)ceil((double)W/32);
int bloH=(int)ceil((double)H/32);
dim3 blocks(bloW, bloH);

clock_t finish1=clock();
clock_t start2=clock();

MatrixCopy<<<blocks,threads>>>(devA, devB, W);
gpuErrchk( cudaPeekAtLastError() );

gpuErrchk( cudaMemcpy( B, devB, W*H*sizeof(float), cudaMemcpyDeviceToHost ) );

clock_t finish2=clock();

printf("\nGPU calculation time (ms): %d\nInitialization time (ms): %d\n\n", (int)ceil(double(((finish2-start2)*1000/(CLOCKS_PER_SEC)))), (int)ceil(double(((finish1-start1)*1000/(CLOCKS_PER_SEC)))));
printf("\n%f\n", A[0]);
printf("\n%f\n\n", B[0]);

gpuErrchk( cudaFree(devA) );
gpuErrchk( cudaFree(devB) );

free(A);
free(B);

#ifdef _WIN32 
    system ("PAUSE"); 
#endif 

return 0;

}

You need to provide more detailed information about your problem as well as source code. Above information is insufficient. — Farzad, Jan 16 '14 at 07:00
Maybe you have the same problem like [here](http://stackoverflow.com/questions/6185117/cudamemcpy-errorthe-launch-timed-out-and-was-terminated). Your kernel just takes to long time to compute and is terminated by the system. — hubs, Jan 16 '14 at 07:24
Even kernels that take only 40 miliseconds to complete have this problem. The only thing that gets rid of it is reducing the matrix size. Just found out I don't even need the large for loop to get this behavior, even without such a loop the same thing happens if I make the matrix large enough (1000x1000) and it only takes 40 miliseconds. — user3085127, Jan 16 '14 at 07:37
The best would be to provide a minimum sized code reproducing your problem with full [CUDA error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) as well as the hardware you are using (you are concerned with hardware limits) and the compilation string. — Vitality, Jan 16 '14 at 08:21
The compilation string is simply "nvcc filename.cu -o filename". The GPU is a Tesla M2090. I use 32x32 threads per block. — user3085127, Jan 16 '14 at 15:34
SO [expects](http://stackoverflow.com/help/on-topic): "1.Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See http://SSCCE.org for guidance." You have not provided a valid code that reproduces the problem. It's likely that you are not doing proper error checking or have some other issue not related to the small snippet of code you have posted. Provide a *short, complete* reproducer of the problem. Something I can copy, paste, compile and run without having to add or change anything. — Robert Crovella, Jan 16 '14 at 16:11
Tesla M2090 has compute capability 2.0, while it seems that you are compiling for the default compute capability 1.0, see [What are the default values for arch and code options when using nvcc?](http://stackoverflow.com/questions/4671607/what-are-the-default-values-for-arch-and-code-options-when-using-nvcc). I'm voting to close the question since the OP seems not willing to produce any further information. — Vitality, Jan 16 '14 at 16:31
You should post your code not as an answer, but by editing your original question. Your code as posted has no [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) but is generating numerous internal errors. To see some of these errors, run your code with `cuda-memcheck`. Add proper cuda error checking to your code. — Robert Crovella, Jan 16 '14 at 17:02
I've included error checking now, for large matrices I get "GPUassert: unspecified launch failure" at line 67 which is the cudamempcy line that copies matrix B from device to host. — user3085127, Jan 16 '14 at 17:20

score 0 · Accepted Answer · edited May 23 '17 at 11:49

0

Your kernel has no thread checking.

You are deciding the grid size (in blocks) like this:

int bloW=(int)ceil((double)W/32);
int bloH=(int)ceil((double)H/32);

For values of H and W that are not even multiples of the threads per block sizes (32) this creates extra threads and blocks, outside of the actual matrix you care about (1000x1000). There's nothing wrong with this; this is common practice.

However, we must make sure those extra threads don't actually do anything (i.e. don't generate invalid accesses to memory). Your kernel does not provide this checking.

If you modify your kernel to be something like this:

__global__ void MatrixCopy(float *A, float *B, int W, int H)
{

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  int j = blockIdx.y*blockDim.y + threadIdx.y;

  if ((i <  W) && (j < H))
    B[j*W + i]=A[j*W + i];

}

I think you'll have better results. Without this, some of your A and B references in the kernel are generating out-of-bounds accesses, which you can see if your run your code with cuda-memcheck. And you'll have to modify the kernel invocation line to add the H parameter as well. I haven't really sorted out whether your i variable corresponds to H or W; I assume you can do that and make the change if needed. In this case, since the matrix is square, it doesn't really matter.

And you should do proper cuda error checking any time you are having trouble with CUDA code. I would suggest doing this before you post here asking for help.

edited May 23 '17 at 11:49

Community

1
1

answered Jan 16 '14 at 17:13

Robert Crovella

143,785
11
213
257

I'm really a beginner at this (I first touched Cuda only a month or two ago and I'm not working with it on a daily basis, that's why I had some trouble getting error checking to work properly). Putting bounds on the indices using an if statement within the kernel seems to solve the problem for this script and another one I wrote that does matrix multiplication. I though if statements were to be avoided in Cuda kernels but I didn't notice a performance hit with the single if statement that places bounds on indices. Thank you very much! – user3085127 Jan 16 '14 at 17:30
Adding -arch=sm_20 to the compilation string turns out not to be necessary, all that's required is that if statement. I knew it had to be something really simple. – user3085127 Jan 16 '14 at 17:36
Yes, it's true that `-arch=sm_20` is not necessary in this case. You are relying on a JIT-compile operation that happens at runtime in order to migrate over the architectural difference between sm_10 (the default, which only supports 512 threads per block) and sm_20, which supports 1024 (and corresponds to the GPU you are actually running on.) – Robert Crovella Jan 16 '14 at 17:44
Yes, but would it be in general advisable to compile for the right architecture? For example, if the OP had `double`s instead of `float`s, would they be demoted or would the JIT preserve the precision? The same question for atomic operations? – Vitality Jan 16 '14 at 17:50
Yes, it's better to compile for the correct architecture. I simply wanted to explain OP's observation. – Robert Crovella Jan 16 '14 at 18:05
-arch=sm_20 is often not necessary but sometimes it is, noted! I've updated all my Cuda scripts and including proper indice bounding if statements makes them all work. I'm listing this thread as solved. Thank you all for your help! – user3085127 Jan 16 '14 at 18:37
That's not how we mark things as solved here on SO. If you like an answer, upvote it (click the triangle above the vote number in the left hand column). If you feel that an answer properly answers your question or solves your issue, accept it (click the checkmark below the vote number). – Robert Crovella Jan 16 '14 at 18:41

Large for loop in Cuda kernel doesn't work for large arrays

1 Answers1