The atomicMin
function defined by CUDA doesn't support use with floating-point quantities. Referring to the documentation, we see that the only available prototypes are for int
, unsigned int
, and unsigned long long int
(the last requiring compiling for, and running on, a GPU of compute capability 3.5 or higher).
There are at least 2 options.
You could refactor your code to replace atomics with a classical parallel reduction.
As indicated in the programming guide, "arbitrary" atomics can be created using atomicCAS
(compare-and-swap) plus a loop of some sort.
Here's one possible realization, for double
:
__device__ double atomicMin_double(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*) address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(fmin(val, __longlong_as_double(assumed))));
} while (assumed != old);
return __longlong_as_double(old);
}
This related question and answers may also be of interest, although it primarily has float
in view.
A few other comments:
by switching to float
instead of double
I believe it is possible to simplify the atomicMin
(or atomicMax
) operation as indicated in the answer I linked to above, probably with a few caveats (e.g. no NaN, INF data, for example). I believe that iee754 float
follows an ordering rule for two quantities A
and B
such that if A > B
, then *reinterpret_cast<int*>(&A) > *reinterpret_cast<int*>(&B)
. I'm not sure double
follows a similar rule with long long
(probably someone else will be able to say).
in your code, this loop could operate on a local quantity first, then do a single atomic operation at the end, like so:
double v = *g;
for( int u = 0; u<16; u++ )
{
v = min(v,d);
}
atomicMin_double(g, v);
which I think should be faster