0

I think I discovered a problem when doing 128-bit signed multiplication in cuda PTX using signed integers. Here is my sample code:

long long result_lo, result_hi;
asm(" mul.lo.s64 %0, 0, -1;     \n\t" // 0 * -1 = 0
    " mul.hi.s64 %1, 0, -1;     \n\t"
    : "=l"(result_lo), "=l"(result_hi));

This should produce the result result_lo = 0x0, result_hi = 0x0. However this produces the result: result_lo = 0x0, result_hi = 0xFFFFFFFFFFFFFFFF which is actualy the value 2^127 - (2^126 - 1) if I'm not mistaken and clearly not zero.

First off, I want to make sure my understanding is correct, but moreso, is there a way around this?

Update Changing from Debug mod to Release mode fixes this issue, still wondering if this is a bug in cuda?

Update 2 Reported this bug to NVIDIA

Used Cuda toolkit 7.5 with Visual Studio 2013. x64 Debug, sm_52, compute_52.

Dane Bouchie
  • 421
  • 5
  • 11
  • It may be a bug in debug mode on `sm_52`. I can reproduce it on `sm_52` debug mode but not on `sm_35` or `sm_20`. Note that maxwell devices don't have native 64 bit integer multiply operations; the compiler produces sequences of 32-bit ops. Therefore [this](http://stackoverflow.com/questions/6162140/128-bit-integer-on-cuda) may be of interest. – Robert Crovella Feb 13 '16 at 23:17
  • @RobertCrovella Since sm_5x (as opposed to sm_2x, sm_3x) does not even have a 32-bit integer multiplier in hardware, the emulation sequences for 64-bit integer multiplication on sm_5x will necessarily differ from the emulation sequences used for sm_2x, sm_3x. A bug report specific to sm_5x seems in order. – njuffa Feb 13 '16 at 23:45
  • Yes, I've filed a bug already. – Robert Crovella Feb 14 '16 at 00:00
  • 2
    As for a workaround, you would want to make sure that the PTXAS optimization level is at least `-O1`. At PTXAS optimization level `-O0` (which is used for debug builds), a bad emulation sequence is being emitted for `sm_5x` targets (FWIW, this emulation sequence does not look like anything I would expect). So try building with `-Xptxas -O1` or higher. – njuffa Feb 14 '16 at 00:59
  • 2
    @njuffa: That would be a perfect answer, if you care to add it. – talonmies Feb 14 '16 at 04:47
  • A fix can be observed in [the 361.28 driver](http://www.nvidia.com/download/driverResults.aspx/98373/en-us). It's an issue with the conversion from PTX->SASS as @njuffa has already indicated in his answer. Therefore, simply loading the 361.28 driver and compiling as usual `nvcc -G -arch=sm_52 ...` will not produce any change in behavior. However the JIT mechanism in 361.28 contains the fix, so if you compile e.g. without a specific arch flag: `nvcc -G ...` the driver JIT will allow the fix will be observed. Future drivers and the next major CUDA toolkit release should also pick up the fix. – Robert Crovella Feb 15 '16 at 20:26

1 Answers1

3

TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64 that is specific to sm_5x platforms, so filing a bug report with NVIDIA is the recommended course of action.

Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x and sm_3x platforms, these are constructed from the machine code instruction IMAD.U32, which is a 32-bit integer multiply-add instruction.

For the Maxwell architecture (that is, sm_5x), a high-throughput, but lower-width, integer multiply-add instruction XMAD was introduced, although a low-throughput legacy 32-bit integer multipy IMUL was apparently retained. Inspection of disassembled machine code generated for sm_5x by the CUDA 7.5 toolchain with cuobjdump --dumpsass shows that for ptxas optimization level -O0 (which is used for debug builds), the 64-bit multiplies are emulated with the IMUL instruction, while for optimization level -O1 and higher XMAD is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.

As it turns out, the IMUL-based emulation for mul.hi.s64 for sm_5x is broken while the XMAD-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1 for ptxas, by specifying -Xptxas -O1 on the nvcc command line. Note that release builds use -Xptxas -O3 by default, so no corrective action is necessary for release builds.

From code analysis, the emulation for mul.hi.s64 is implemented as a wrapper around the emulation for mul.hi.u64, and this latter emulation seems to work fine on all platforms including sm_5x. Thus another possible workaround is to use our own wrapper around mul.hi.u64. Coding with inline PTX is unnecessary in this case, since mul.hi.s64 and mul.hi.u64 are accessible via the device intrinsics __mul64hi() and __umul64hi(). As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.

    long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
    asm(" mul.hi.s64 %0, %1, %2;     \n\t"
        : "=l"(result)
        : "l"(m1), "l"(m2));
#else
    result = __umul64hi (m1, m2);
    if (m1 < 0LL) result -= m2;
    if (m2 < 0LL) result -= m1;
#endif
njuffa
  • 23,970
  • 4
  • 78
  • 130