3

I'm new to OpenCL, been trying to implement a 3 level nested loop in Kernel function. Guess my understanding is not enough. Below is the C code of the logic

void scale(float *output, float *scales, int batch, int n, int size)
{
    int i,j,b;
    for(b = 0; b < batch; ++b){
        for(i = 0; i < n; ++i){
            for(j = 0; j < size; ++j){
                output[(b*n+i)*size+j] *= scales[i];
            }
        }
    }
}

Where output and scales are 1D arrays. Ex:

float output[18] = {1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9};
float scales[9] = {1,0,1,0,1,0,1,0,1};

int n = 9;
int size = 2;
int batch = 1;

The expected output is Output:

1.000000  2.000000  0.000000  0.000000  5.000000  6.000000  
0.000000  0.000000  9.000000  1.000000  0.000000  0.000000 
4.000000  5.000000  0.000000  0.000000  8.000000  9.000000

Below is my OpenCL kernel

__kernel void scale_kernel(__global float *output, __global float *biases, int n, int size)
{
    int j = get_global_id(0);
    int i = get_group_id(1);
    int b = get_group_id(2);

    if(j < size) output[(b*n+i)*size + j] *= biases[i];
}

I hope this implementation is correct and the way I'm launching the NDkernel is wrong. My BLOCK size is 16 (Think this is where my understanding is wrong).

size_t global_work_size[3] = {size-1)/BLOCK + 1, n, batch};
size_t local_work_size[3] = {BLOCK, 1, 1};
cl.error = clEnqueueNDRangeKernel(queue, kernel, 3, 0, global_work_size, local_work_size, 0, 0, NULL);

EDIT 1:

Changing the global_work_size as below produces the expected output, I've set local_work_size as NULL in this case. This might not provide the best performance.

size_t global_work_size[3] = {size, n, batch};
    cl.error = clEnqueueNDRangeKernel(queue, kernel, 3, 0, global_work_size, NULL, 0, 0, NULL);

Please let me know how to choose global_work_size , local_work_size.

Avis
  • 988
  • 2
  • 11
  • 31
  • That is not the best example for OpenCL: you can achieve the best performances when the operations inside the for loop are INDEPENDENT among them. In this case, the situation is not so easy to analyze: consecutive iterations need the results of some previous iterations in order to execute the multiplication. This means that not all the operations can be executed in parallel and they need to be serialized. However, it is possible to solve the problem using "barriers" (i.e. https://stackoverflow.com/questions/6890302/barriers-in-opencl). – Leos313 Aug 06 '17 at 15:52
  • In addition, you are trying to use the results of some threads as an input of some other threads. I recommend you to read, as a similar case-study, about "matrix multiplication" and "reduction" – Leos313 Aug 06 '17 at 15:58
  • Looks like image processing, maybe you can fix everything with 2d or even 1d kernel launches and computing the i,j values inside of kernel. If this is image processing, you should pick each "batch" as a new kernel and compute independently(addresses of them not looking like overlapped so you should try that) to gain performance over hiding latencies. I guess j is scan line, n is height and batch is number of pictures. – huseyin tugrul buyukisik Aug 06 '17 at 21:13
  • 1
    Leos313 , huseyin : Thank you very much for your time. The mistake I made was in the x dimension of global_work_size and local_work_size. I changed it to {size, n, batch} and set the local_work_size as NULL so that OpenCL itself will assign a value. Now my outputs are matching. – Avis Aug 07 '17 at 03:04
  • I might not get the best performance by doing the change mentioned in my comment, so I will leave the question open for anyone to suggest to optimize the global & local size to get the best performance. – Avis Aug 07 '17 at 03:06

0 Answers0