What is the optimum OpenCL 2 kernel to sum floats?

Question

C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.

Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan example:

kernel void sum_float (global float* sum, global float* values)
{
  float sum_val = 0.0f;

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    float value = work_group_reduce_add(values[index]);
    sum_val += work_group_broadcast(value, 0u);
  }

  sum[0] = sum_val;
}

The kernel above works (or seems to!). However, exclusive_scan required the work_group_broadcast function to pass the last value of one work group to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val, so an atomic add is more appropriate.

OpenCL 2 provides an atomic_int which supports atomic_fetch_add. An integer version of the kernel above using atomic_int is:

kernel void sum_int (global int* sum, global int* values)
{
  atomic_int sum_val;
  atomic_init(&sum_val, 0);

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    int value = work_group_reduce_add(values[index]);
    atomic_fetch_add(&sum_val, value);
  }

  sum[0] = atomic_load(&sum_val);
}

OpenCL 2 also provides an atomic_float but it doesn't support atomic_fetch_add.

What is the best way to implement an OpenCL2 kernel to sum floats?

Isn't `sum[0] = atomic_load(&sum_val);` a data race condition? Did you mean adding in register in loop and then adding with atomic only once in the end? — huseyin tugrul buyukisik, Oct 21 '17 at 11:26
@huseyintugrulbuyukisik I don't know. The best description I can find about using OpenCL 2 atomics is [here](https://software.intel.com/en-us/articles/using-opencl-20-atomics). I couldn't find any example kernels using OpenCL 2 atomics, which is why I asked the question. If you have a correct implementation of the sum_int kernel (and ideally, the sum_float kernel too) then please provide it as an answer. — kenba, Oct 21 '17 at 12:05

huseyin tugrul buyukisik · Accepted Answer · 2017-10-21T12:45:11.040

kernel void sum_float (global float* sum, global float* values)
{
  float sum_val = 0.0f;

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    float value = work_group_reduce_add(values[index]);
    sum_val += work_group_broadcast(value, 0u);
  }

  sum[0] = sum_val;
}

this has a race condition to write data to sum's zero-indexed element, all workgroups are doing same computation which makes this O(N*N) instead of O(N) and takes more than 1100 milliseconds to complete a 1M-element array sum.

For same 1-M element array, this(global=1M, local=256)

kernel void sum_float2 (global float* sum, global float* values)
{
      float sum_partial = work_group_reduce_add(values[get_global_id(0)]);
      if(get_local_id(0)==0)
        sum[get_group_id(0)] = sum_partial; 
}

followed by this (global=4k, local=256)

kernel void sum_float3 (global float* sum, global float* values)
{
  float sum_partial = work_group_reduce_add(sum[get_global_id(0)]);
  if(get_local_id(0)==0)
    values[get_group_id(0)] = sum_partial; 
}

does the same thing in a few miliseconds except a third step. First one gets each group sums into their group-id related item and second kernel sums those into 16 values and these 16 values can easily summed by CPU(microseconds or less)(as third step).

Program works like this:

values: 1.0 1.0 .... 1.0 1.0 
sum_float2
sum: 256.0 256.0 256.0
sum_float3
values: 65536.0 65536.0 .... 16 items total to be summed by cpu

if you need to use atomics, you should do it as sparsely as possible. Easiest example can be using local atomics to sum many values by each group and then doing last step using a single global atomic function per group to add all. I don't have a C++ setup ready for OpenCL for now, but I guess OpenCL 2.0 atomics are better when you are using multiple devices with same memory resource(probably streaming mode or in SVM) and/or a CPU using C++17 functions. If you don't have multiple devices computing on same area at same time, then I suppose that these new atomics can only be a micro-optimization on top of already working OpenCL 1.2 atomics. I didn't use these new atomics so take all these as a grain of salt.

Wow, that was quick! Your answer is a good solution, and appreciate your point about atomics. However, I would still like to see how to use OpenCL 2 atomics correctly, especially choosing between `atomic_compare_exchange_strong` and `atomic_compare_exchange_weak`. — kenba, Oct 21 '17 at 12:29
I didn't use them before. Let me fiddle with some kernel :) It may take a while, maybe tomorrow I can answer fully. Since I'm using C# directly, it may be impossible for me now. — huseyin tugrul buyukisik, Oct 21 '17 at 12:34
I've implemented just one of your kernels, swapping the input and output buffers in the CPU code where appropriate, it works like a charm. It's much better than using `atomics`! ;) — kenba, Oct 21 '17 at 15:50
Nice. I'll look for opencl 2.0 atomics later but I can do only in-kernel things for now, like atomics themselves but without trying if they work with C++17 lock guards or not. — huseyin tugrul buyukisik, Oct 21 '17 at 16:28

What is the optimum OpenCL 2 kernel to sum floats?

1 Answers1

Linked