OpenCL - using atomic reduction for double

Question

I know atomic functions with OpenCL-1.x are not recommended but I just want to understand an atomic example.

The following kernel code is not working well, it produces random final values for the computation of sum of all array values (sum reduction) :

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(volatile __local double *val, double delta)
{
  union {
  double f;
  ulong  i;
  } old, new;

  do
  {
   old.f = *val;
   new.f = old.f + delta;
  } 
  while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __local double *localInput,
               __global double *finalSum
                 )
{

  uint lid = get_local_id(0); 
  uint gid = get_global_id(0);
  uint localSize = get_local_size(0);
  uint groupid = get_group_id(0);
  local double partialSum;
  local double finalSumTemp;

 // Initialize sums
  if (lid==0)
  {
   partialSum = 0.0;
   finalSumTemp = 0.0;
  }
  barrier(CLK_LOCAL_MEM_FENCE);

  // Set in local memory
  int idx = groupid * localSize + lid;
  localInput[lid] = input[idx];

  // Compute atom_add into each workGroup
  barrier(CLK_LOCAL_MEM_FENCE);
  atom_add_double(&partialSum, localInput[lid]);
  // See and Check if barrier below is necessary
  barrier(CLK_LOCAL_MEM_FENCE);

  // Final sum of partialSums
  if (lid==0)
  {
   atom_add_double(&finalSumTemp, partialSum);
   *finalSum = finalSumTemp;
  }

}

The version with global id strategy works good but the version above, which passes by the using of local memory (shared memory), doesn't give the expected results (the value of *finalSum is random for each execution).

Here the Buffers and kernel args that I have put in my host code :

 // Write to buffers
  ret = clEnqueueWriteBuffer(command_queue, inputBuffer, CL_TRUE, 0,
        nWorkItems * sizeof(double), xInput, 0, NULL, NULL);
  ret = clEnqueueWriteBuffer(command_queue, finalSumBuffer, CL_TRUE, 0,
                      sizeof(double), finalSumGPU, 0, NULL, NULL);

 // Set the arguments of the kernel
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
  clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
  clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&finalSumBuffer);

and Finally, I read finalSumBuffer to get the sum value.

I think my issue comes rather from the kernel code but I can't find where is the error.

If anyone could see what's wrong, this would be nice to tell me.

Thanks

UPDATE 1 :

I nearly manage to perform this reduction. Following the propositions suggested by huseyin tugrul buyukisik, I have modified the kernel code like this :

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(volatile __local double *val, double delta)
{
  union {
  double d;
  ulong  i;
  } old, new;

  do
  {
   old.d = *val;
   new.d = old.d + delta;
  } 
  while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __local double *localInput,
               __local double *partialSum,
               __global double *finalSum
                 )
{

  uint lid = get_local_id(0); 
  uint gid = get_global_id(0);
  uint localSize = get_local_size(0);
  uint groupid = get_group_id(0);

  // Initialize partial sums
  if (lid==0)
    partialSum[groupid] = 0.0; 


  barrier(CLK_LOCAL_MEM_FENCE);
  // Set in local memory
  int idx = groupid * localSize + lid;
  localInput[lid] = input[idx];

  // Compute atom_add into each workGroup
  barrier(CLK_LOCAL_MEM_FENCE);
  atom_add_double(&partialSum[groupid], localInput[lid]);
  // See and Check if barrier below is necessary
  barrier(CLK_LOCAL_MEM_FENCE);

  // Compute final sum
  if (lid==0)
    *finalSum += partialSum[groupid]; 

}

As said huseyin , I don't need to use atomic functions for the final sum of all partial sums.

So I did at the end :

// Compute final sum
  if (lid==0)
    *finalSum += partialSum[groupid];

But unfortunately, the final sum doesn't give the value expected and the value is random (for example, with nwork-items = 1024 and size-WorkGroup = 16, I get random values in the order of [1e+3 - 1e+4] instead of 5.248e+05 expected.

Here are the setting of arguments into the host code :

 // Set the arguments of the kernel
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
  clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
  clSetKernelArg(kernel, 2, nWorkGroups*sizeof(double), NULL);
  clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&finalSumBuffer);

Could you see where is my error in the kernel code ?

Thanks

I forget to say in my bounty comment that I would like to do this reduction with OpenCL-1.x atomic function (not with OpenCL-2.x) — , Feb 06 '17 at 07:40

huseyin tugrul buyukisik · Accepted Answer · 2017-02-06T19:49:31.483

0

Not an error but logic issue:

atom_add_double(&finalSumTemp, partialSum);

is working only once per group (by zero-local-indexed thread).

So you are just doing

finalSumTemp = partialSum

so atomics here is not needed.

There is race condition for

*finalSum = finalSumTemp;

between workgroups where each zero-index local thread writes to same address. So this should be the atomic addition (for learning purposes) or could be written on different cells to be added on host side such as sum_group1+sum_group2+... = total sum.

int idx = groupid * localSize + lid;
localInput[lid] = input[idx];

here using groupid is suspicious for multi-device summation. Because each device has its own global range and workgroup id indexings so two device could have same group id values for two different groups. Some device related offset should be used when multiple devices are used. Such as:

idx= get_global_id(0) + deviceOffset[deviceId];

Also if atomic operation is inavoidable, and if exactly N times operated, it could be moved to a single thread(such as 0-indexed thread) and looped for N times(probably being faster) in a second kernel unless that atomic operation latency can't be hidden by other means.

edited Feb 06 '17 at 19:49

answered Feb 06 '17 at 19:40

huseyin tugrul buyukisik

11,469
4
45
97

concerning your first remark, could you give me a valid kernel code or a pseudo kernel code which could work for reduction of an array of double ? regards – Feb 06 '17 at 23:49
you are adding to `finalSumTemp` only by local_id=0 and that variable is different per group so it doesn't need atomic add. Locally adding local variables then globally adding those partial sums is okay for learning purposes just as you do but with the corrections. Did you test the corrections I wrote? – huseyin tugrul buyukisik Feb 06 '17 at 23:52
thanks, I did modifications in UPDATE 1 of my first post. I have removed the final atomic_add function to compute the sum of all partial sums. But it doesn't seem to work. I nearly manage to perform this reduction, it is frustrating. – Feb 07 '17 at 11:26
if (lid==0) *finalSum += partialSum[groupid]; needs to be atomic, you read it wrong maybe? because it has race condition between groups. finalsumtemp is the non-atomic line. – huseyin tugrul buyukisik Feb 07 '17 at 11:30
Sorry, I think you didn't see the modifications of my kernel code indicated below UPDATE 1, I don't use finalsumtemp anymore, I use partialSum[groupid] for atomic_add. I know there is no way to synchronize all the workgroups between them and that's the issue. If you look at closer my new code below UPDATE 1, maybe the problem is that I use "volatile __local double" type in atom_add_double and not "__global double". I have made the choice of "volatile __local" to benefit from the local memory strategy instead of classical global one.But actually, finalSum is declared as global – Feb 07 '17 at 12:21
finalSum is still not atomic as i see, did you forgot to change it to atomic?, – huseyin tugrul buyukisik Feb 07 '17 at 12:24
I did it ! My problem is that, initially, I didn't know how to declare finalSum as a global variable (for atomic_add_double of the final sum of partial sums) and in the same time, perform the atomic_add_double for the computation of each partial sum (which needs local variables). So the arguments between these 2 situations were not consistent. I have circumvented this issue by declare 2 differents atom_add_double functions, one called atom_add_double_local for the computation of partial sum (with local declared variables) and another one called atom_add_global for the final sum. – Feb 07 '17 at 12:34
Huseyin: do you see another solution instead of declare 2 different atom_add_double functions – Feb 07 '17 at 12:35
Good, you have found solution. I didn't write "use exact same atomic add function" though. For learning purposes, local atomic + global atomic is okay. – huseyin tugrul buyukisik Feb 07 '17 at 12:41
Huseyin: your comment on performing the reduction "for multi-device summation" is very interesting. I didn't find on the web the function "deviceOffset[deviceId];" Is it really called "deviceOffset" ? If you had a link which talks about handling the global index with multi-devices, this would be kind to give it to me. Regards – Feb 08 '17 at 20:28
@youpilat13 no its just another parameter by you if you use multiple devices I just thought you could have multiple gpu – huseyin tugrul buyukisik Feb 09 '17 at 09:08

OpenCL - using atomic reduction for double

1 Answers1

Linked