128 bit integer on cuda?

Question

I just managed to install my cuda SDK under Linux Ubuntu 10.04. My graphic card is an NVIDIA geForce GT 425M, and I'd like to use it for some heavy computational problem. What I wonder is: is there any way to use some unsigned 128 bit int var? When using gcc to run my program on the CPU, I was using the __uint128_t type, but using it with cuda doesn't seem to work. Is there anything I can do to have 128 bit integers on cuda?

njuffa · Accepted Answer · 2012-07-21T03:00:41.023

For best performance, one would want to map the 128-bit type on top of a suitable CUDA vector type, such as uint4, and implement the functionality using PTX inline assembly. The addition would look something like this:

typedef uint4 my_uint128_t;
__device__ my_uint128_t add_uint128 (my_uint128_t addend, my_uint128_t augend)
{
    my_uint128_t res;
    asm ("add.cc.u32      %0, %4, %8;\n\t"
         "addc.cc.u32     %1, %5, %9;\n\t"
         "addc.cc.u32     %2, %6, %10;\n\t"
         "addc.u32        %3, %7, %11;\n\t"
         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
         : "r"(addend.x), "r"(addend.y), "r"(addend.z), "r"(addend.w),
           "r"(augend.x), "r"(augend.y), "r"(augend.z), "r"(augend.w));
    return res;
}

The multiplication can similarly be constructed using PTX inline assembly by breaking the 128-bit numbers into 32-bit chunks, computing the 64-bit partial products and adding them appropriately. Obviously this takes a bit of work. One might get reasonable performance at the C level by breaking the number into 64-bit chunks and using __umul64hi() in conjuction with regular 64-bit multiplication and some additions. This would result in the following:

__device__ my_uint128_t mul_uint128 (my_uint128_t multiplicand, 
                                     my_uint128_t multiplier)
{
    my_uint128_t res;
    unsigned long long ahi, alo, bhi, blo, phi, plo;
    alo = ((unsigned long long)multiplicand.y << 32) | multiplicand.x;
    ahi = ((unsigned long long)multiplicand.w << 32) | multiplicand.z;
    blo = ((unsigned long long)multiplier.y << 32) | multiplier.x;
    bhi = ((unsigned long long)multiplier.w << 32) | multiplier.z;
    plo = alo * blo;
    phi = __umul64hi (alo, blo) + alo * bhi + ahi * blo;
    res.x = (unsigned int)(plo & 0xffffffff);
    res.y = (unsigned int)(plo >> 32);
    res.z = (unsigned int)(phi & 0xffffffff);
    res.w = (unsigned int)(phi >> 32);
    return res;
}

Below is a version of the 128-bit multiplication that uses PTX inline assembly. It requires PTX 3.0, which shipped with CUDA 4.2, and the code requires a GPU with at least compute capability 2.0, i.e. a Fermi or Kepler class device. The code uses the minimal number of instructions, as sixteen 32-bit multiplies are needed to implement a 128-bit multiplication. By comparison, the variant above using CUDA intrinsics compiles to 23 instructions for an sm_20 target.

__device__ my_uint128_t mul_uint128 (my_uint128_t a, my_uint128_t b)
{
    my_uint128_t res;
    asm ("{\n\t"
         "mul.lo.u32      %0, %4, %8;    \n\t"
         "mul.hi.u32      %1, %4, %8;    \n\t"
         "mad.lo.cc.u32   %1, %4, %9, %1;\n\t"
         "madc.hi.u32     %2, %4, %9,  0;\n\t"
         "mad.lo.cc.u32   %1, %5, %8, %1;\n\t"
         "madc.hi.cc.u32  %2, %5, %8, %2;\n\t"
         "madc.hi.u32     %3, %4,%10,  0;\n\t"
         "mad.lo.cc.u32   %2, %4,%10, %2;\n\t"
         "madc.hi.u32     %3, %5, %9, %3;\n\t"
         "mad.lo.cc.u32   %2, %5, %9, %2;\n\t"
         "madc.hi.u32     %3, %6, %8, %3;\n\t"
         "mad.lo.cc.u32   %2, %6, %8, %2;\n\t"
         "madc.lo.u32     %3, %4,%11, %3;\n\t"
         "mad.lo.u32      %3, %5,%10, %3;\n\t"
         "mad.lo.u32      %3, %6, %9, %3;\n\t"
         "mad.lo.u32      %3, %7, %8, %3;\n\t"
         "}"
         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
         : "r"(a.x), "r"(a.y), "r"(a.z), "r"(a.w),
           "r"(b.x), "r"(b.y), "r"(b.z), "r"(b.w));
    return res;
}

@njuffa - I assume today you would suggest a solution based on 2 64-bit values? — einpoklum, May 30 '18 at 12:56
@einpoklum Unlikely, since 64-bit integer operations are emulated and it is usually best to build emulations on top of native instructions rather than other emulations. Because 32-bit integer multiply and multiply-add are themselves emulated on Maxwell and Pascal architectures, it would possibly be best to use native *16-bit* multiplies there which map to the machine instruction `XMAD` (a 16x16+32 bit multiply-add operation). I *read* that native 32-bit integer multiplies were restored with the Volta architecture , but I have no hands-on experience with Volta yet. — njuffa, May 30 '18 at 15:41
How is performance compared to 32 bit integers? 1/16 or similar? — huseyin tugrul buyukisik, Jun 07 '18 at 10:31
@huseyintugrulbuyukisik Based on instruction count it would be *around* 1/16 of a native 32-bit multiplication. The actual performance impact could vary a bit depending on code context based on the loading of functional units and register usage. — njuffa, Jun 07 '18 at 14:17
@proteneer Best I know the GPU hardware only supports atomic operations up to a size of 64 bits. I have not researched whether one could cleverly construct atomics for larger types via clever software constructs. — njuffa, Jan 22 '20 at 20:59

score 13 · Answer 2 · answered May 28 '11 at 15:28

13

CUDA doesn't support 128 bit integers natively. You can fake the operations yourself using two 64 bit integers.

Look at this post:

typedef struct {
  unsigned long long int lo;
  unsigned long long int hi;
} my_uint128;

my_uint128 add_uint128 (my_uint128 a, my_uint128 b)
{
  my_uint128 res;
  res.lo = a.lo + b.lo;
  res.hi = a.hi + b.hi + (res.lo < a.lo);
  return res;
}

answered May 28 '11 at 15:28

tkerwin

9,559
1
31
47

Thank you very much! Just one more question: from an efficiency point of view, is this going to be fast enough? – Matteo Monti May 28 '11 at 18:59
I tested that code on my CPU. It actually works, but it's 6 times slower than using the __uint128_t type... isn't there any way to make it faster? – Matteo Monti May 28 '11 at 22:04
4

You tested built-in 128 bit integers on CPU with this `my_uint128` on the CPU? Of course the native support will be faster. The hope is that performance on the GPU with this 128 bit type will be faster than performance on the CPU with built-in 128 bit integers. – tkerwin May 28 '11 at 22:52
Is the link broken? – ueir Feb 10 '20 at 20:51

score 3 · Answer 3 · answered May 30 '18 at 13:00

3

A much-belated answer, but you could consider using this library:

https://github.com/curtisseizert/CUDA-uint128

which defines a 128-bit-sized structure, with methods and freestanding utility functions to get it to function as expected, which allow it to be used like a regular integer. Mostly.

answered May 30 '18 at 13:00

einpoklum

118,144
57
340
684

This is really cool, and much better answer than the others :) After looking at the source code, I saw that there's a __mul64hi PTX instruction that makes 64 * 64 bit multiplication efficient. – Adam Ritter Mar 27 '19 at 00:46

score 2 · Answer 4 · answered Apr 06 '22 at 18:23

For posterity, note that as of 11.5, CUDA and nvcc support __int128_t in device code when the host compiler supports it (e.g., clang/gcc, but not MSVC). 11.6 added support for debug tools with __int128_t.

See:

128 bit integer on cuda?

4 Answers4

Linked

Related