TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64
that is specific to sm_5x
platforms, so filing a bug report with NVIDIA is the recommended course of action.
Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x
and sm_3x
platforms, these are constructed from the machine code instruction IMAD.U32
, which is a 32-bit integer multiply-add instruction.
For the Maxwell architecture (that is, sm_5x
), a high-throughput, but lower-width, integer multiply-add instruction XMAD
was introduced, although a low-throughput legacy 32-bit integer multipy IMUL
was apparently retained. Inspection of disassembled machine code generated for sm_5x
by the CUDA 7.5 toolchain with cuobjdump --dumpsass
shows that for ptxas
optimization level -O0
(which is used for debug builds), the 64-bit multiplies are emulated with the IMUL
instruction, while for optimization level -O1
and higher XMAD
is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.
As it turns out, the IMUL
-based emulation for mul.hi.s64
for sm_5x
is broken while the XMAD
-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1
for ptxas
, by specifying -Xptxas -O1
on the nvcc
command line. Note that release builds use -Xptxas -O3
by default, so no corrective action is necessary for release builds.
From code analysis, the emulation for mul.hi.s64
is implemented as a wrapper around the emulation for mul.hi.u64
, and this latter emulation seems to work fine on all platforms including sm_5x
. Thus another possible workaround is to use our own wrapper around mul.hi.u64
. Coding with inline PTX is unnecessary in this case, since mul.hi.s64
and mul.hi.u64
are accessible via the device intrinsics __mul64hi()
and __umul64hi()
. As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.
long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
asm(" mul.hi.s64 %0, %1, %2; \n\t"
: "=l"(result)
: "l"(m1), "l"(m2));
#else
result = __umul64hi (m1, m2);
if (m1 < 0LL) result -= m2;
if (m2 < 0LL) result -= m1;
#endif