Looking for examples for `atomic_fetch_add` for float32 in OpenCL 3.0

Question

It appears that OpenCL 3.0 had added support to the long-waited atomic operations for floating point numbers, however, after spending hours, I still can't find a single example showing how to use such functions.

I've already been using a common hack to achieve float32 atomic_add, but I wanted to try OpenCL 3's built-in support, I tried defining a macro to call atomic_fetch_add, like below

#if __OPENCL_C_VERSION__ >= CL_VERSION_3_0
  #pragma OPENCL EXTENSION cl_ext_float_atomics : enable
  #define atomicadd(a,b) atomic_fetch_add((volatile atomic_float *)(a),(b)) 
#else
  inline float atomicadd(volatile __global float* address, const float value) {
    float old = value, orig;
    while ((old = atomic_xchg(address, (orig = atomic_xchg(address, 0.0f)) + old)) != 0.0f);
    return orig;
  }
#endif

but I am getting tons of errors:

<kernel>:320:26: warning: unknown OpenCL extension 'cl_ext_float_atomics' - ignoring
#pragma OPENCL EXTENSION cl_ext_float_atomics : enable
                         ^
<kernel>:773:17: error: no matching function for call to 'atomic_fetch_add'
                atomicadd(& field[*idx1d + tshift * gcfg->dimlen.z], -p[0].w);
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<kernel>:321:24: note: expanded from macro 'atomicadd'
#define atomicadd(a,b) atomic_fetch_add((volatile atomic_float *)(a),(b)) 
                       ^~~~~~~~~~~~~~~~
cl_kernel.h:4571:1: note: candidate function not viable: no known conversion from 'volatile atomic_float *' to 'volatile atomic_int *__attribute__((address_space(16776963)))' for 1st argument
DECL_ATOMIC_FETCH_MOD(atomic_int, int, int)
^
cl_kernel.h:4563:3: note: expanded from macro 'DECL_ATOMIC_FETCH_MOD'
  DECL_ATOMIC_FETCH_MOD_OP(add, A, C, M) \
  ^
...

where field[] is a global memory float buffer. My computer has 2x GTX 2080 with driver 515.x. clinfo reports that both devices support OpenCL 3.0

what is the right way to call atomic_fetch_add with float type?

It seems that Nvidia GPUs still only support the OpenCL C 1.2 language standard, as can be queried with cl_device.getInfo(). — ProjectPhysX, Sep 24 '22 at 19:01
from [this link](https://developer.nvidia.com/blog/nvidia-is-now-opencl-3-0-conformant/), it appears that nvidia driver 465 or newer added support for OpenCL 3.0. it is confirmed from `clinfo` output: `Number of platforms: 2 Platform Name: NVIDIA CUDA, Platform Vendor: NVIDIA Corporation; Platform Version: OpenCL 3.0 CUDA 11.6.134` — FangQ, Sep 24 '22 at 23:32
`cl_ext_float_atomics` is an optional extension and a vendor can claim conformance without providing that feature. — Björn Lindqvist, Sep 27 '22 at 12:46
Does this answer your question? [Atomic addition to floating point values in OpenCL for NVIDIA GPUs?](https://stackoverflow.com/questions/72044986/atomic-addition-to-floating-point-values-in-opencl-for-nvidia-gpus) — Björn Lindqvist, Sep 27 '22 at 12:47
@BjörnLindqvist, my code already has a dedicated [CUDA version](https://github.com/fangq/mcx), therefore, the opencl version was really designed for general use (CPU, multi-vendor GPUs). Using PTX assembly won't solve issues in a portable fashion. — FangQ, Sep 28 '22 at 15:26
After further reading on this, I realized that @ProjectPhysX's answer was correct - as also described in [this post](https://stackoverflow.com/a/67372358/4271392), while `CL_DEVICE_VERSION` supports ocl3.0, `CL_DEVICE_OPENCL_C_VERSION` suggests that compiler only supports ocl1.2 on NVIDIA devices. my clinfo shows `Device OpenCL C Version: OpenCL C 1.2` on all my nvidia GPUs, including a recently acquired 3090. Shame on NVIDIA for not updating their OpenCL driver in order to stay monopolized with CUDA. — FangQ, Sep 28 '22 at 19:59

ProjectPhysX · Accepted Answer · 2022-09-29T15:42:22.280

1

Making my initial comment the answer here:

Nvidia GPUs still only support the OpenCL C 1.2 language standard, as can be queried with cl_device.getInfo<CL_DEVICE_OPENCL_C_VERSION>(). The Platform version is reported as 3.0, but the features are still unchanged from 1.2, especially the recent cl_ext_float_atomics extension is not yet supported.

In theory you could make a switch in code between the usual atomics_add_f workaround and the inline PTX version based on if the device vendor is reported as "Nvidia", or based on if some common nv_... extensions are available.

However this is still not the elegant universally compatible solution that cl_ext_float_atomics promises. It's a very desired feature and I hope the vendors will implement it soon.

edited Sep 29 '22 at 15:42

answered Sep 29 '22 at 04:35

ProjectPhysX

4,535
2
14
34

1

Answer accepted. I want to add that while `atomic_xchg` approach works, it is very inefficient (at least on NVIDIA GPUs). By profiling my similarly implemented [OpenCL](https://github.com/fangq/mcxcl) and [CUDA](https://github.com/fangq/mcx) codes, the atomic writing part costs nearly 50% of the run-time in the OpenCL implementation compared to only 10% on CUDA (making OpenCL about 2x slower). So, allowing to use CUDA-like `atomic_add` for float via supporting `cl_ext_float_atomics` is not just desired for easy implementation, but also will make a big impact to speed. – FangQ Sep 29 '22 at 13:06

Looking for examples for `atomic_fetch_add` for float32 in OpenCL 3.0

1 Answers1