CUDA: large kernel gives strange behavior

Question

I recently bought a gtx550ti boost card. Programs that used to work on my old gf440 card fails. Here is an example. The following program works fine with smaller kernels, but goes wrong with larger ones.

#include "stdio.h"

__global__ void kernel(float * d_in, float * d_out){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * blockDim.x * gridDim.x;
d_out[idx] = d_in[idx];
}


int main(){
    const dim3 gridSize(10,10);
    const dim3 blockSize(80,80);
    const int size = 800*800;
    float * h_in  = new float[size];
    float * h_out = new float[size];
    float * d_in;
    float * d_out;
    cudaMalloc((void**)&d_in, sizeof(float)*size);
    cudaMalloc((void**)&d_out, sizeof(float)*size);
    for(int i = 0; i < size; i++)
        h_in[i] = (float)i;

    cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);
    kernel<<<gridSize,blockSize>>>(d_in, d_out);
    cudaMemcpy(h_out, d_out, sizeof(float)*size, cudaMemcpyDeviceToHost);

    for(int i = 0; i < size; i++)
        printf("%f\n",h_out[i]);

    cudaFree(d_in);
    cudaFree(d_out);
    return 0;
}

I expected it to output index in floats. But it outputs some random floats:

size 100*100

Instead, when I switch to size 100*100:

const dim3 gridSize(10,10);
const dim3 blockSize(10,10);
const int size = 100*100;

And it works fine(last 5 outputs):

9995.000000
9996.000000
9997.000000
9998.000000
9999.000000

size 500*500

But for larger size 500*500:

const dim3 gridSize(10,10);
const dim3 blockSize(50,50);
const int size = 500*500;

It outputs wrong index(last 5 outputs):

512139.000000
512140.000000
512141.000000
512142.000000
512143.000000

I installed CUDA 5.5. Thanks!

score 1 · Accepted Answer · edited May 23 '17 at 10:25

1

Whenever you are having trouble with cuda code, you should be doing proper cuda error checking.

This is not valid:

    const dim3 blockSize(80,80);

This is asking for a threadblock of 80*80 = 6400 threads. There are no GPUs that support 6400 threads per threadblock.

This is also not valid:

const dim3 blockSize(50,50);

2500 threads is also too many. These configs would not work on either of your cards.

This is acceptable:

const dim3 blockSize(10,10);

In the "not valid" cases, your kernel is not running. If you had done proper cuda error checking, you would have discovered this and even got a clue as to what might be wrong (invalid launch configuration).

You may also want to familiarize yourself with the deviceQuery cuda sample, and study the output for your GPUs.

edited May 23 '17 at 10:25

Community

1
1

answered Aug 15 '13 at 03:51

Robert Crovella

143,785
11
213
257

Thanks, 1024 is the block size limit and also the per dimension limit! So what is the maximum number of threads per gird? – user2684645 Aug 15 '13 at 04:00
That is not specified as far as I know. It's a large number, whatever it is. – Robert Crovella Aug 15 '13 at 04:10
Maximum number of threads per multiprocessor: 2048 Max dimension size of a grid size: (2147483647, 65535, 65535) And I have 4 multiprocessors. What is the maximum number of threads I can launch in a kernel? – user2684645 Aug 15 '13 at 04:15
Say I want to launch a kernel that copies 2^30 number of elements. How can this be done? Thanks! – user2684645 Aug 15 '13 at 04:18
Post a new question please. You can certainly create a kernel with that many threads, each thread copying one element, if you wanted to. The grid dimensions are in *blocks*, each block can be up to 1024 threads. If you do the math, you'll discover what I mean by a *large number*. It's certainly not the "largest" possible grid, but what is 1024*2147483647? That would be the largest number of threads in a 1-D threadblock/grid structure for a cc3.x device. I won't be answering any further new questions in these comments. – Robert Crovella Aug 15 '13 at 04:30
Ok I posted a new question – user2684645 Aug 15 '13 at 04:37

CUDA: large kernel gives strange behavior

1 Answers1