0

From this question and this question I managed to compile a minimal example of summing a vector into a single double inside OpenCL 1.2.

    /* https://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html */
    inline void AtomicAdd(volatile __global double *source, const double operand) {
      union { unsigned int intVal; double floatVal; } prevVal, newVal;
      do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
      } while( atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal );
    }

    void kernel cost_function(__constant double* inputs, __global double* outputs){
      int index = get_global_id(0);

      if(0 == error_index){ outputs[0] = 0.0; }
      barrier(CLK_GLOBAL_MEM_FENCE);

      AtomicAdd(&outputs[0], inputs[index]); /* (1) */
      //AtomicAdd(&outputs[0], 5.0); /* (2) */

    }

As in fact this solution is incorrect because the result is always 0 when the buffer is accessed. What might the problem with this?

the code at /* (1) */ doesn't work, and neither does the code at /* (2) */, which is only there to test the logic independent of any inputs.

Is barrier(CLK_GLOBAL_MEM_FENCE); used correctly here to reset the output before any calculations are done to it?

According to the specs in OpenCL 1.2 single precision floating point numbers are supported by atomic operations, is this(AtomicAdd) a feasible method of extending the support to double precision numbers or am I missing something?

Of course the device I am testing with supports cl_khr_fp64˙of course.

Dávid Tóth
  • 2,788
  • 1
  • 21
  • 46

1 Answers1

2

Your AtomicAdd is incorrect. Namely, the 2 errors are:

  1. In the union, intVal must be a 64-bit integer and not 32-bit integer.
  2. Use the 64-bit atom_cmpxchg function and not the 32-bit atomic_cmpxchg function.

The correct implementation is:

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
inline void AtomicAdd(volatile __global double *source, const double operand) {
    union { unsigned ulong u64; double f64; } prevVal, newVal;
    do {
        prevVal.f64 = *source;
        newVal.f64 = prevVal.f64 + operand;
    } while(atom_cmpxchg((volatile __global ulong*)source, prevVal.u64, newVal.u64) != prevVal.u64);
}

barrier(CLK_GLOBAL_MEM_FENCE); is used correctly here. Note that a barrier must not be in an if- or else-branch.

UPDATE: According to STREAMHPC, the original implementation you use is not guaranteed to produce correct results. There is an improved implementation:

void __attribute__((always_inline)) atomic_add_f(volatile global float* addr, const float val) {
    union {
        uint  u32;
        float f32;
    } next, expected, current;
    current.f32 = *addr;
    do {
        next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
        current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
    } while(current.u32!=expected.u32);
}

#ifdef cl_khr_int64_base_atomics
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void __attribute__((always_inline)) atomic_add_d(volatile global double* addr, const double val) {
    union {
        ulong  u64;
        double f64;
    } next, expected, current;
    current.f64 = *addr;
    do {
        next.f64 = (expected.f64=current.f64)+val; // ...*val for atomic_mul_d()
        current.u64 = atom_cmpxchg((volatile global ulong*)addr, expected.u64, next.u64);
    } while(current.u64!=expected.u64);
}
#endif
ProjectPhysX
  • 4,535
  • 2
  • 14
  • 34
  • 1
    Awesome! Meanwhile I figured it out as well! Although it only works if the cast is to `volatile __global ulong`. Thank you for your answer! – Dávid Tóth Jan 23 '22 at 12:43
  • 1
    and the following needs to be set: `#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable` If you extend your answer with this information I will acept it; and thank you! – Dávid Tóth Jan 23 '22 at 12:45
  • 1
    Correct! I missed those 2 things. – ProjectPhysX Jan 23 '22 at 12:50
  • 1
    @David Tóth please see my uptated answer again: There is an improved implementation of the algorithm. – ProjectPhysX Jan 30 '22 at 11:12
  • According to the post `atomic_cmpxchg` needs to be used, but that was used originally as well... I think I am missing the logical changes between the two versions of kernel code; To be more precise, I don't see what is guaranteeing that the result will not be. According to the specs, the only difference is that `atom_cmpxchg` is for 64 bit numbers, and `atomic_cmpxchg` is for 32 bits.. – Dávid Tóth Feb 01 '22 at 07:05
  • 1
    AH! Nevermind I think I got it: the return value of the function is not correct, but the memory load of the current is. – Dávid Tóth Feb 01 '22 at 07:21