CUDA: Hitting the maximum possible value of blockIdx.x for compute capability 3.0

Question

First I should say I'm quite new to programming in C++ (let alone CUDA), though it is what I first learned with about 184 years ago. I'd say I'm a bit out of touch with memory allocation, and datatype sizes, though I'm learning. Anyway here goes:

I have a GPU with compute capability 3.0 (It's a Geforce 660 GTX w/ 2GB of DRAM).

Going by ./deviceQuery found in the CUDA samples (and by other charts I've found online), my maximum grid size is listed:

Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

At 2,147,483,647 (2^31-1) that x dimension is huge and kind of nice… YET, when I run my code, pushing beyond 65535 in the x dimension, things get... weird.

I used an example from an Udacity course, and modified it to test the extremes. I've kept the kernel code fairly simple to prove the point:

__global__ void referr(long int *d_out, long int *d_in){
  long int idx = blockIdx.x;
  d_out[idx] = idx;
}

Please note below the ARRAY_SIZE being the size of the grid, but also being the size of the array of integers on which to do operations. I am leaving the size of the blocks at 1x1x1. JUST for the sake of understanding the limitations, I KNOW that having this many operations with blocks of only 1 thread makes no sense, but I want to understand what's going on with the grid size limitations.

int main(int argc, char ** argv) {
  const long int ARRAY_SIZE = 522744;
  const long int ARRAY_BYTES = ARRAY_SIZE * sizeof(long int);

  // generate the input array on the host
  long int h_in[ARRAY_SIZE];
  for (long int i = 0; i < ARRAY_SIZE; i++) {
    h_in[i] = i;
  }
  long int h_out[ARRAY_SIZE];

  // declare GPU memory pointers
  long int *d_in;
  long int *d_out;

  // allocate GPU memory
  cudaMalloc((void**) &d_in, ARRAY_BYTES);
  cudaMalloc((void**) &d_out, ARRAY_BYTES);

  // transfer the array to the GPU
  cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

  // launch the kernel with ARRAY_SIZE blocks in the x dimension, with 1 thread each.
  referr<<<ARRAY_SIZE, 1>>>(d_out, d_in);

  // copy back the result array to the CPU
  cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

  // print out the resulting array
  for (long int i =0; i < ARRAY_SIZE; i++) {
    printf("%li", h_out[i]);
    printf(((i % 4) != 3) ? "\t" : "\n");
  }

  cudaFree(d_in);
  cudaFree(d_out);

  return 0;
}

This works as expected with an ARRAY_SIZE at MOST of 65535. The last few lines of the output below

65516   65517   65518   65519
65520   65521   65522   65523
65524   65525   65526   65527
65528   65529   65530   65531
65532   65533   65534

If I push the ARRAY_SIZE beyond this the output gets really unpredictable and eventually if the number gets too high I get a Segmentation fault (core dumped) message… whatever that even means. Ie. with an ARRAY_SIZE of 65536:

65520   65521   65522   65523
65524   65525   65526   65527
65528   65529   65530   65531
65532   65533   65534   131071

Why is it now stating that the blockIdx.x for this last one is 131071?? That is 65535+65535+1. Weird.

Even weirder, when I set the ARRAY_SIZE to 65537 (65535+2) I get some seriously strange results for the last lines of the output.

65520   65521   65522   65523
65524   65525   65526   65527
65528   65529   65530   65531
65532   65533   65534   131071
131072  131073  131074  131075
131076  131077  131078  131079
131080  131081  131082  131083
131084  131085  131086  131087
131088  131089  131090  131091
131092  131093  131094  131095
131096  131097  131098  131099
131100  131101  131102  131103
131104  131105  131106  131107
131108  131109  131110  131111
131112  131113  131114  131115
131116  131117  131118  131119
131120  131121  131122  131123
131124  131125  131126  131127
131128  131129  131130  131131
131132  131133  131134  131135
131136  131137  131138  131139
131140  131141  131142  131143
131144  131145  131146  131147
131148  131149  131150  131151
131152  131153  131154  131155
131156  131157  131158  131159
131160  131161  131162  131163
131164  131165  131166  131167
131168  131169  131170  131171
131172  131173  131174  131175
131176  131177  131178  131179
131180  131181  131182  131183
131184  131185  131186  131187
131188  131189  131190  131191
131192  131193  131194  131195
131196  131197  131198  131199
131200

Isn't 65535 the limit for older GPUs? Why is my GPU "messing up" when I push past the 65535 barrier for the x grid dimension? Or is this by design? What in the world is going on?

Wow, sorry for the long question.

Any help to understand this would be greatly appreciated! Thanks!

@JackOLantern No, as addressed by Robert Crovella. This let me push beyond 65535 blocks. But I'm still getting Segmentation fault (core dumped) messages for anything beyond a certain array size and have no way of knowing why! I will choose his answer if these are unrelated. — Arjun Mehta, Oct 03 '13 at 16:14
I'd strongly recommend against structuring kernels with a dependency on large grid dimensions. Write grid-stride loops per http://cudahandbook.to/15QbFWx — ArchaeaSoftware, Oct 03 '13 at 19:11
@ArchaeaSoftware Thanks for that! As I mentioned in the question, I understand that, but from a purely academic perspective, I needed to understand why I was limited to something well below what the compute capability reads. RobertCrovella answered this effectively! — Arjun Mehta, Oct 07 '13 at 19:43

score 6 · Accepted Answer · edited May 23 '17 at 10:31

6

You should be using proper CUDA error checking . And you should be compiling for a compute 3.0 architecture by specifying -arch=sm_30 when you compile with nvcc.

edited May 23 '17 at 10:31

Community

1
1

answered Oct 03 '13 at 13:54

Robert Crovella

143,785
11
213
257

Thank you so much for this Robert! Using -arch=sm_30 while compiling let me increase the number of blocks to roughly 522,714. Nice! I made everything uniformly long ints as well, to make things simple. However, after 522,714 I still get a Segmentation fault (core dumped) message, even while wrapping all API calls in the error checking functions that are provided in your link. I think it may have to do with a memory copy, and the sheer size of the array. 522,714*4 = 2090856 which is VERY close to 2MB. Could this be a clue? – Arjun Mehta Oct 03 '13 at 15:57
PS. If you think this might not be a related issue, I would be happy to select this as the answer, as it did seem to address the core issue of pushing beyond 65535 blocks! Thanks! – Arjun Mehta Oct 03 '13 at 16:06
4

It's a separate issue. Seg faults originate from host code. In this case, you are running into problems with `h_in` and `h_out`. These types of allocations are stack allocations, and stack allocations are limited in size. Instead, allocate these variables from the system heap using host side `malloc` or similar. Yes, this is an unrelated issue. – Robert Crovella Oct 03 '13 at 16:12
Thank you so much for your help and for cluing me in to the notion of stack allocations :) – Arjun Mehta Oct 03 '13 at 16:23

CUDA: Hitting the maximum possible value of blockIdx.x for compute capability 3.0

1 Answers1