I am using the following kernel for sum reduciton.
__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
unsigned int stride = gid * 2;
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s = localSize >> 2; s > 0; s >>= 1)
{
if(tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if(tid == 0) output[bid] = sdata[0];
}
It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.
I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?