1

I am newbie to cuda programming. I want to achieve vector addition of large arrays. Size of arrays is 1 million. Since I cannot create 1 million 1d blocks. I have thought of making 1000x1000 blocks with 1 thread in each block.

relevant main code

//Copy to GPU
printf( "GPU\n" );
cudaMemcpy( dev_src1, src1, size * sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( dev_src2, src2, size * sizeof(int), cudaMemcpyHostToDevice );

//Exec kernel
int nBlocks = ceil(sqrt(size));
int nThreadsPerBlock = 1;
addVector<<<dim3(nBlocks,nBlocks),nThreadsPerBlock>>>(dev_src1, dev_src2, dev_dest, size );

//Copy results to CPU
cudaMemcpy( dest, dev_dest, size * sizeof(int), cudaMemcpyDeviceToHost );

GPU kernel

__global__ void addVector( int * src1, int * src2, int * dest, int size )
{
    int tid = blockIdx.y*blockDim.x + blockIdx.x;
    if( tid<size )
    dest[tid] = src1[tid] + src2[tid];
}

However, I do not get correct results after this. What could be my mistake and how to rectify it?

Here is how my results look like --

0: 0 + 0 = 0
1: 1 + 2 = 3
2: 2 + 4 = 6
3: 3 + 6 = 9
4: 4 + 8 = 12
5: 5 + 10 = 15
6: 6 + 12 = 18
7: 7 + 14 = 21
8: 8 + 16 = 24
9: 9 + 18 = 27
10: 10 + 20 = 266
11: 11 + 22 = 267
12: 12 + 24 = 268
13: 13 + 26 = 269
14: 14 + 28 = 270
15: 15 + 30 = 271
.
.

86: 86 + 172 = 342
87: 87 + 174 = 343
88: 88 + 176 = 344
89: 89 + 178 = 345
90: 90 + 180 = 346
91: 91 + 182 = 347
92: 92 + 184 = 348
93: 93 + 186 = 349
94: 94 + 188 = 350
95: 95 + 190 = 351
96: 96 + 192 = 352
97: 97 + 194 = 353
98: 98 + 196 = 354
99: 99 + 198 = 355
mkuse
  • 2,250
  • 4
  • 32
  • 61
  • why one thread per block? That waste almost all the computational capacity of your GPU. – talonmies Jan 29 '13 at 10:10
  • ok, may be my bad. please can you suggest a better configuration in my case and to achieve it. – mkuse Jan 29 '13 at 10:11
  • What is efficient and why you "do not get correct results" are two separate questions. Perhaps focus on the latter first. Can you edit your question to explain what results you do get, and include error checking in your code to confirm that the kernel is actually running at all? – talonmies Jan 29 '13 at 10:13
  • For error checking, you will probably want to read [this question and answer](http://stackoverflow.com/q/14038589/681865). – talonmies Jan 29 '13 at 10:15
  • I have confirmed that the kernel works. I can successfully run cuda-examples. – mkuse Jan 29 '13 at 10:16

1 Answers1

3

In your current scenario, you are calculating the tid incorrectly.

If there is 1 thread per block, then blockDim.x would be 1.

The tid should be calculated as:

int tid = blockIdx.y * gridDim.x + blockIdx.x;

However I would not recommend creating 1 thread per block and creating large number of blocks. It is plain inefficient as the occupancy of the kernel will be very less.

A recommended approach is to create a large enough block e.g 128 or 256 threads per block, and then create a grid large enough to cover your whole data.

For example:

int nThreadsPerBlock = 256;
int nBlocks = (size + nThreadsPerBlock - 1)/nThreadsPerBlock;

With this approach, the tid will be calculated as:

int tid = blockIdx.x * blockDim.x + threadIdx.x;
sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • yes, I am aware that 1 threads per block is a bad choice. I am trying to get a grip on usage of blockDim, threadIdx etc.. Can you also point me to a resource that gives lucid explanation on that? – mkuse Jan 29 '13 at 10:30
  • The best option is to thoroughly read the book [CUDA By Example](https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0). – sgarizvi Jan 29 '13 at 10:33
  • yes, I am reading the same. But went out of the way to explore stuff. Anyway, thanks for your help. – mkuse Jan 29 '13 at 10:36
  • @mkuse... Another very good resource is the Online Coursera Course of [Heterogeneous Parallel Programming](https://www.coursera.org/courses). Register for the course and download the lecture slides and video lectures. This course is extremely beneficial. – sgarizvi Jan 29 '13 at 10:40