2

In CUDA to cover multiple blocks, and thus incerase the range of indices for arrays we do some thing like this:

Host side Code:

 dim3 dimgrid(9,1)// total 9 blocks will be launched    
 dim3 dimBlock(16,1)// each block is having 16 threads  // total no. of threads in  
                   //   the grid is thus 16 x9= 144.        

Device side code

 ...
 ...     
 idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143 
 a[idx]=a[idx]*a[idx];
 ...
 ...    

What is the equivalent in OpenCL for acheiving the above case ?

gpuguy
  • 4,607
  • 17
  • 67
  • 125
  • you may want to change the device side code. idx will range from 0 to 11 with the configuration you launched. you may want to do idx = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x; for it to have range between 0 and 143. – Pavan Yalamanchili May 01 '12 at 04:07
  • Ooops that was by mnistake... edited block dim, to one dim – gpuguy May 01 '12 at 04:42
  • So I did it this way: Host side: localWorkSize[0]= 16; globalWorkSize[0] = 9*16; Device Side: int i= get_global_id(0) a[idx]=a[idx]*a[idx]; ... ... – gpuguy May 01 '12 at 08:10

2 Answers2

4

On the host, when you enqueue your kernel using clEnqueueNDRangeKernel, you have to specify the global and local work size. For instance:

size_t global_work_size[1] = { 144 }; // 16 * 9 == 144
size_t local_work_size[1] = { 16 };
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
                       global_work_size, local_work_size,
                       0, NULL, NULL);

In your kernel, use:

size_t get_global_size(uint dim);
size_t get_global_id(uint dim);
size_t get_local_size(uint dim);
size_t get_local_id(uint dim);

to retrieve the global and local work sizes and indices respectively, where dim is 0 for x, 1 for y and 2 for z.

The equivalent of your idx will thus be simply size_t idx = get_global_id(0);

See the OpenCL Reference Pages.

Simon
  • 31,675
  • 9
  • 80
  • 92
1

Equivalences between CUDA and OpenCL are:

blockIdx.x*blockDim.x+threadIdx.x = get_global_id(0)

LocalSize = blockDim.x

GlobalSize = blockDim.x * gridDim.x