Optimal workgroup size for sum reduction in OpenCL

Question

I am using the following kernel for sum reduciton.

__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
    // load shared mem
    unsigned int tid = get_local_id(0);
    unsigned int bid = get_group_id(0);
    unsigned int gid = get_global_id(0);

    unsigned int localSize = get_local_size(0);
    unsigned int stride = gid * 2;
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);
    // do reduction in shared mem
    for(unsigned int s = localSize >> 2; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) output[bid] = sdata[0];
}

It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.

I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?

This is completely hardware-dependant. What are you running your kernel on? — user703016, Oct 14 '14 at 14:27
@Cicada: That's the point. I want to test it on different devices: Intel HD Graphics 4400, and 2 x NVIDIA Tesla C2070 (Multi-GPU). So it would be better, if I could write a function, which calculates the optimal size dynamically. As far as I understand, the best approach so far is to test with different values and find the optimal one experimentally. — krisg, Oct 15 '14 at 08:37
The question reffers to "optimal" size. In my point of view, getting a fairly good value is easy, but getting an optimal one can be a hell. I would recommend to use some easy heuristic value that is supposed to give good values. Optimal can only be found via testing, in order to be 100% sure of it (due to many different HWs) — DarkZeros, Oct 15 '14 at 15:16

score 2 · Accepted Answer · edited May 23 '17 at 11:49

This question is a possible duplicate of one I answered a while back: What is the algorithm to determine optimal work group size and number of workgroup.

Experimentation will be the best way to know for sure for any given device.

Update: I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.

For each device:

1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.

3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.

//define your own context, kernel, queue here

int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;

err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err

size_t test_size;

//your vars for hi-res timer go here

for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
    //reset timer
    test_size = (size_t)i;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
    if(err){
        fail("Unable to enqueue kernel");  //implement your own fail function somewhere..
    }else{
        clfinish(queue);
        //stop timer, save value
        //output timer value and test_size
    }
}

The device-specific kernel can look like this, except the first line should have your optimal value substituted:

#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
    unsigned int tid = get_local_id(0);
    unsigned int stride = get_global_id(0) * 2;
    __local float sdata[LOCAL_SIZE];
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
        if(tid < s){
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if(tid == 0) output[get_group_id(0)] = sdata[0];
}

I know that it is similar, and your answer to that question is really helpfull. I thought that some of you have experience exactly with the reduction algorithm. It is mentioned in a variety of OpenCL books, but in most cases only the kernel code is shown. — krisg, Oct 15 '14 at 08:32
Updated answer with main portion of host code. Drop it into your implementation and make the necessary changes. — mfa, Oct 15 '14 at 14:37

Optimal workgroup size for sum reduction in OpenCL

1 Answers1

Linked