C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic
behaviour for non-commutative
operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.
Intel have an example here which uses OpenCL 2 work group
kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan
example:
kernel void sum_float (global float* sum, global float* values)
{
float sum_val = 0.0f;
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
float value = work_group_reduce_add(values[index]);
sum_val += work_group_broadcast(value, 0u);
}
sum[0] = sum_val;
}
The kernel above works (or seems to!). However, exclusive_scan
required the work_group_broadcast
function to pass the last value of one work group
to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val
, so an atomic add
is more appropriate.
OpenCL 2 provides an atomic_int
which supports atomic_fetch_add
. An integer version of the kernel above using atomic_int is:
kernel void sum_int (global int* sum, global int* values)
{
atomic_int sum_val;
atomic_init(&sum_val, 0);
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
int value = work_group_reduce_add(values[index]);
atomic_fetch_add(&sum_val, value);
}
sum[0] = atomic_load(&sum_val);
}
OpenCL 2 also provides an atomic_float
but it doesn't support atomic_fetch_add
.
What is the best way to implement an OpenCL2 kernel to sum floats?