Efficiently dividing unsigned value by a power of two, rounding up - in CUDA

Question

I was just reading:

Efficiently dividing unsigned value by a power of two, rounding up

and I was wondering what was the fastest way to do this in CUDA. Of course by "fast" I mean in terms of throughput (that question also addressed the case of subsequent calls depending on each other).

For the lg() function mentioned in that question (base-2 logarithm of divisor), suppose we have:

template <typename T> __device__ int find_first_set(T x);
template <> __device__ int find_first_set<uint32_t>(uint32_t x) { return __ffs(x);   }
template <> __device__ int find_first_set<uint64_t>(uint64_t x) { return __ffsll(x); }

template <typename T> __device__ int lg(T x) { return find_first_set(x) - 1; }

Edit: Since I've been made aware that there's no find-first-sert in PTX, nor in the instruction set of all nVIDIA GPUs up to this time, let's replace that lg() with the following:

template <typename T> __df__ int population_count(T x);
template <> int population_count<uint32_t>(uint32_t x) { return __popc(x);   }
template <> int population_count<uint64_t>(uint64_t x) { return __popcll(x); }

template <typename T>
__device__ int lg_for_power_of_2(T x) { return population_count(x - 1); }

and we now need to implement

template <typename T> T div_by_power_of_2_rounding_up(T p, T q);

... for T = uint32_t and T = uint64_t. (p is the dividend, q is the divisor).

Notes:

As in the original question, we may not assume that p <= std::numeric_limits<T>::max() - q or that p > 0 - that would collapse the various interesting alternatives :-)
0 is not a power of 2, so we may assume q != 0.
I realize solutions might differ for 32-bit and 64-bit; I'm more interested in the former but also in the latter.
Let's focus on Maxwell and Pascal chips.

Is an extra pre-condition allowed? It can be written `(dividend + mask) >> log_2_of_divisor` *if* that doesn't cause wrapping — harold, Apr 22 '17 at 21:32
@harold: No; like in the original question, that's not allowed, see edit. — einpoklum, Apr 22 '17 at 21:36

score 3 · Answer 1 · edited Apr 23 '17 at 20:24

With funnel shifting available, a possible 32 bit strategy is doing a 33bit shift (essentially) preserving the carry of the addition so it be done before the shift, such as this: (not tested)

unsigned sum = dividend + mask;
unsigned result = __funnelshift_r(sum, sum < mask, log_2_of_divisor);

Edit by @einpoklum:

Tested using @RobertCrovella's program, seems to work fine. The test kernel PTX for SM_61 is:

    .reg .pred      %p<2>;
    .reg .b32       %r<12>;


    ld.param.u32    %r5, [_Z4testjj_param_0];
    ld.param.u32    %r6, [_Z4testjj_param_1];
    neg.s32         %r7, %r6;
    and.b32         %r8, %r6, %r7;
    clz.b32         %r9, %r8;
    mov.u32         %r10, 31;
    sub.s32         %r4, %r10, %r9;
    add.s32         %r11, %r6, -1;
    add.s32         %r2, %r11, %r5;
    setp.lt.u32     %p1, %r2, %r11;
    selp.u32        %r3, 1, 0, %p1;
    // inline asm
    shf.r.wrap.b32 %r1, %r2, %r3, %r4;
    // inline asm
    st.global.u32   [r], %r1;
    ret;

and the SASS is:

/*0008*/                   MOV R1, c[0x0][0x20];                 /* 0x4c98078000870001 */
/*0010*/                   MOV R0, c[0x0][0x144];                /* 0x4c98078005170000 */
/*0018*/                   IADD R2, RZ, -c[0x0][0x144];          /* 0x4c1100000517ff02 */
                                                                 /* 0x001c4c00fe4007f1 */
/*0028*/                   IADD32I R0, R0, -0x1;                 /* 0x1c0ffffffff70000 */
/*0030*/                   LOP.AND R2, R2, c[0x0][0x144];        /* 0x4c47000005170202 */
/*0038*/                   FLO.U32 R2, R2;                       /* 0x5c30000000270002 */
                                                                 /* 0x003fd800fe2007e6 */
/*0048*/                   IADD R5, R0, c[0x0][0x140];           /* 0x4c10000005070005 */
/*0050*/                   ISETP.LT.U32.AND P0, PT, R5, R0, PT;  /* 0x5b62038000070507 */
/*0058*/                   IADD32I R0, -R2, 0x1f;                /* 0x1d00000001f70200 */
                                                                 /* 0x001fc400fe2007f6 */
/*0068*/                   IADD32I R0, -R0, 0x1f;                /* 0x1d00000001f70000 */
/*0070*/                   SEL R6, RZ, 0x1, !P0;                 /* 0x38a004000017ff06 */
/*0078*/                   MOV32I R2, 0x0;                       /* 0x010000000007f002 */
                                                                 /* 0x0003c400fe4007e4 */
/*0088*/                   MOV32I R3, 0x0;                       /* 0x010000000007f003 */
/*0090*/                   SHF.R.W R0, R5, R0, R6;               /* 0x5cfc030000070500 */
/*0098*/                   STG.E [R2], R0;                       /* 0xeedc200000070200 */
                                                                 /* 0x001f8000ffe007ff */
/*00a8*/                   EXIT;                                 /* 0xe30000000007000f */
/*00b0*/                   BRA 0xb0;                             /* 0xe2400fffff87000f */
/*00b8*/                   NOP;                                  /* 0x50b0000000070f00 */

Oh, this is awesome! I completely forgot about "funnel shift", since when I had noticed it before I didn't have anything interesting to do with it. Could you possibly link to other noteworthy uses of it? — einpoklum, Apr 22 '17 at 22:16
The same strategy can also be extended to 64 bits. Seems to require different code paths for shifts <=32 bit and > 32 bits though. — tera, Apr 22 '17 at 22:17
@tera what's your strategy? I'm not really seeing a nice way to extend it — harold, Apr 22 '17 at 22:23
Just use two funnel shifts (one for the hi word, one for the lo word). The two code paths then come from the fact that `__funnelshift_r()` limits the shift to <=31 bits. — tera, Apr 22 '17 at 22:29
@einpoklum I don't have any links, the only uses of funnelshift I know of are rotations and big-int shifting (which this is an instance of, and so is using it to manipulate bit-streams). I'm across the road from you btw, at UvA. — harold, Apr 23 '17 at 20:29
How about we chat in real life then? My profile is linked to my GitHub profile so you can pretty easily figure out my name and email... — einpoklum, Apr 23 '17 at 20:40

score 2 · Answer 2 · edited May 23 '17 at 12:09

Here's an adaptation of a well-performing answer for the CPU:

template <typename T>
__device__ T div_by_power_of_2_rounding_up(T dividend, T divisor)
{
    auto log_2_of_divisor = lg(divisor);
    auto mask = divisor - 1;
    auto correction_for_rounding_up = ((dividend & mask) + mask) >> log_2_of_divisor;

    return (dividend >> log_2_of_divisor) + correction_for_rounding_up;
}

I wonder whether one can do much better.

The SASS code (using @RobertCrovella's test kernel) for SM_61 is:

code for sm_61
        Function : test(unsigned int, unsigned int)
.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                           /* 0x001fd400fe2007f6 */
/*0008*/                   MOV R1, c[0x0][0x20];           /* 0x4c98078000870001 */
/*0010*/                   IADD R0, RZ, -c[0x0][0x144];    /* 0x4c1100000517ff00 */
/*0018*/                   MOV R2, c[0x0][0x144];          /* 0x4c98078005170002 */
                                                           /* 0x003fc40007a007f2 */
/*0028*/                   LOP.AND R0, R0, c[0x0][0x144];  /* 0x4c47000005170000 */
/*0030*/                   FLO.U32 R3, R0;                 /* 0x5c30000000070003 */
/*0038*/                   IADD32I R0, R2, -0x1;           /* 0x1c0ffffffff70200 */
                                                           /* 0x001fc400fcc017f5 */
/*0048*/                   IADD32I R3, -R3, 0x1f;          /* 0x1d00000001f70303 */
/*0050*/                   LOP.AND R2, R0, c[0x0][0x140];  /* 0x4c47000005070002 */
/*0058*/                   IADD R2, R0, R2;                /* 0x5c10000000270002 */
                                                           /* 0x001fd000fe2007f1 */
/*0068*/                   IADD32I R0, -R3, 0x1f;          /* 0x1d00000001f70300 */
/*0070*/                   MOV R3, c[0x0][0x140];          /* 0x4c98078005070003 */
/*0078*/                   MOV32I R6, 0x0;                 /* 0x010000000007f006 */
                                                           /* 0x001fc400fc2407f1 */
/*0088*/                   SHR.U32 R4, R2, R0.reuse;       /* 0x5c28000000070204 */
/*0090*/                   SHR.U32 R5, R3, R0;             /* 0x5c28000000070305 */
/*0098*/                   MOV R2, R6;                     /* 0x5c98078000670002 */
                                                           /* 0x0003c400fe4007f4 */
/*00a8*/                   MOV32I R3, 0x0;                 /* 0x010000000007f003 */
/*00b0*/                   IADD R0, R4, R5;                /* 0x5c10000000570400 */
/*00b8*/                   STG.E [R2], R0;                 /* 0xeedc200000070200 */
                                                           /* 0x001f8000ffe007ff */
/*00c8*/                   EXIT;                           /* 0xe30000000007000f */
/*00d0*/                   BRA 0xd0;                       /* 0xe2400fffff87000f */
/*00d8*/                   NOP;                            /* 0x50b0000000070f00 */
                                                           /* 0x001f8000fc0007e0 */
/*00e8*/                   NOP;                            /* 0x50b0000000070f00 */
/*00f0*/                   NOP;                            /* 0x50b0000000070f00 */
/*00f8*/                   NOP;                            /* 0x50b0000000070f00 */

with FLO being the "find leading 1" instruction (thanks @tera). Anyway, those are lots of instructions, even if you ignore the loads from (what looks like) constant memory... the CPU function inspiring this one compiles into just:

    tzcnt   rax, rsi
    lea     rcx, [rdi - 1]
    shrx    rax, rcx, rax
    add     rax, 1
    test    rdi, rdi
    cmove   rax, rdi

(with clang 3.9.0).

1

`FLO.U32` -> Find Leading One. – tera Apr 22 '17 at 23:42

tera · Answer 3 · 2017-05-24T21:00:03.307

template <typename T> __device__ T div_by_power_of_2_rounding_up(T p, T q)
{
    return p==0 ? 0 : ((p - 1) >> lg(q)) + 1;
}

One instruction shorter than Robert's previous answer (but see his comeback) if my count is correct, or the same instruction count as the funnel shift. Has a branch though - not sure if that makes a difference (other than a benefit if the entire warp gets zero p inputs):

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

    code for sm_61
        Function : _Z4testjj
    .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                                         /* 0x001fc000fda007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                                 /* 0x4c98078000870001 */
        /*0010*/                   ISETP.EQ.AND P0, PT, RZ, c[0x0][0x140], PT;           /* 0x4b6503800507ff07 */
        /*0018*/         {         MOV R0, RZ;                                           /* 0x5c9807800ff70000 */
        /*0028*/               @P0 BRA 0x90;        }                                    /* 0x001fc800fec007fd */
                                                                                         /* 0xe24000000600000f */
        /*0030*/                   IADD R0, RZ, -c[0x0][0x144];                          /* 0x4c1100000517ff00 */
        /*0038*/                   LOP.AND R0, R0, c[0x0][0x144];                        /* 0x4c47000005170000 */
                                                                                         /* 0x003fc400ffa00711 */
        /*0048*/                   FLO.U32 R0, R0;                                       /* 0x5c30000000070000 */
        /*0050*/                   MOV R3, c[0x0][0x140];                                /* 0x4c98078005070003 */
        /*0058*/                   IADD32I R2, -R0, 0x1f;                                /* 0x1d00000001f70002 */
                                                                                         /* 0x001fd800fcc007f5 */
        /*0068*/                   IADD32I R0, R3, -0x1;                                 /* 0x1c0ffffffff70300 */
        /*0070*/                   IADD32I R2, -R2, 0x1f;                                /* 0x1d00000001f70202 */
        /*0078*/                   SHR.U32 R0, R0, R2;                                   /* 0x5c28000000270000 */
                                                                                         /* 0x001fc800fe2007f6 */
        /*0088*/                   IADD32I R0, R0, 0x1;                                  /* 0x1c00000000170000 */
        /*0090*/                   MOV32I R2, 0x0;                                       /* 0x010000000007f002 */
        /*0098*/                   MOV32I R3, 0x0;                                       /* 0x010000000007f003 */
                                                                                         /* 0x001ffc00ffe000f1 */
        /*00a8*/                   STG.E [R2], R0;                                       /* 0xeedc200000070200 */
        /*00b0*/                   EXIT;                                                 /* 0xe30000000007000f */
        /*00b8*/                   BRA 0xb8;                                             /* 0xe2400fffff87000f */
        ..........................

I believe it should still be possible to shave an instruction or two from the funnel shift by writing it in PTX (Morning update: as Robert has proven in the meantime), but I really need to go to bed.

Update2: Doing that (using Harold's funnel shift and writing the function in PTX)

_device__ uint32_t div_by_power_of_2_rounding_up(uint32_t p, uint32_t q)
{
  uint32_t ret;
  asm volatile("{\r\t"
               ".reg.u32        shift, mask, lo, hi;\n\t"
               "bfind.u32       shift, %2;\r\t"
               "sub.u32         mask, %2, 1;\r\t"
               "add.cc.u32      lo, %1, mask;\r\t"
               "addc.u32        hi, 0, 0;\r\t"
               "shf.r.wrap.b32  %0, lo, hi, shift;\n\t"
               "}"
                : "=r"(ret) : "r"(p), "r"(q));
  return ret;
}

just gets us to the same instruction count as Robert has already achieved with his simpler C code:

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

    code for sm_61
        Function : _Z4testjj
    .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                            /* 0x001fc000fec007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                    /* 0x4c98078000870001 */
        /*0010*/                   MOV R0, c[0x0][0x144];                   /* 0x4c98078005170000 */
        /*0018*/         {         IADD32I R2, R0, -0x1;                    /* 0x1c0ffffffff70002 */
        /*0028*/                   FLO.U32 R0, c[0x0][0x144];        }      /* 0x001fc400fec00716 */
                                                                            /* 0x4c30000005170000 */
        /*0030*/                   IADD R5.CC, R2, c[0x0][0x140];           /* 0x4c10800005070205 */
        /*0038*/                   IADD.X R6, RZ, RZ;                       /* 0x5c1008000ff7ff06 */
                                                                            /* 0x003fc800fc8007f1 */
        /*0048*/                   MOV32I R2, 0x0;                          /* 0x010000000007f002 */
        /*0050*/                   MOV32I R3, 0x0;                          /* 0x010000000007f003 */
        /*0058*/                   SHF.R.W R0, R5, R0, R6;                  /* 0x5cfc030000070500 */
                                                                            /* 0x001ffc00ffe000f1 */
        /*0068*/                   STG.E [R2], R0;                          /* 0xeedc200000070200 */
        /*0070*/                   EXIT;                                    /* 0xe30000000007000f */
        /*0078*/                   BRA 0x78;                                /* 0xe2400fffff87000f */
        ..........................

score 2 · Answer 4 · edited May 23 '17 at 10:31

riffing off of the kewl answer by @tera:

 template <typename T> __device__ T pdqru(T p, T q)
{
    return bool(p) * (((p - 1) >> lg(q)) + 1);
}

11 instructions (no branches, no predication) to get the result in R0:

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_61
                Function : _Z4testjj
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                   /* 0x001fc800fec007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];           /* 0x4c98078000870001 */
        /*0010*/                   IADD R0, RZ, -c[0x0][0x144];    /* 0x4c1100000517ff00 */
        /*0018*/                   LOP.AND R0, R0, c[0x0][0x144];  /* 0x4c47000005170000 */
                                                                   /* 0x003fc400ffa00711 */
        /*0028*/                   FLO.U32 R0, R0;                 /* 0x5c30000000070000 */
        /*0030*/                   MOV R5, c[0x0][0x140];          /* 0x4c98078005070005 */
        /*0038*/                   IADD32I R2, -R0, 0x1f;          /* 0x1d00000001f70002 */
                                                                   /* 0x001fd800fcc007f5 */
        /*0048*/                   IADD32I R0, R5, -0x1;           /* 0x1c0ffffffff70500 */
        /*0050*/                   IADD32I R2, -R2, 0x1f;          /* 0x1d00000001f70202 */
        /*0058*/                   SHR.U32 R0, R0, R2;             /* 0x5c28000000270000 */
                                                                   /* 0x001fd000fe2007f1 */
        /*0068*/                   IADD32I R0, R0, 0x1;            /* 0x1c00000000170000 */
        /*0070*/                   MOV32I R2, 0x0;                 /* 0x010000000007f002 */
        /*0078*/                   MOV32I R3, 0x0;                 /* 0x010000000007f003 */
                                                                   /* 0x001ffc001e2007f2 */
        /*0088*/                   ICMP.NE R0, R0, RZ, R5;         /* 0x5b4b02800ff70000 */
        /*0090*/                   STG.E [R2], R0;                 /* 0xeedc200000070200 */
        /*0098*/                   EXIT;                           /* 0xe30000000007000f */
                                                                   /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0;                       /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                            /* 0x50b0000000070f00 */
                ..........................

After studying the above SASS code, it seemed evident that these two instructions:

        /*0038*/                   IADD32I R2, -R0, 0x1f;          /* 0x1d00000001f70002 */
                                                                   /* 0x001fd800fcc007f5 */
        ...
        /*0050*/                   IADD32I R2, -R2, 0x1f;          /* 0x1d00000001f70202 */

shouldn't really be necessary. I don't have a precise explanation, but my assumption is that because the FLO.U32 SASS instruction does not have precisely the same semantics as the __ffs() intrinsic, the compiler apparently has an idiom when using that intrinsic, which wraps the basic FLO instruction that is doing the work. It wasn't obvious how to work around this at the C++ source code level, but I was able to use the bfind PTX instruction in a way to reduce the instruction count further, to 7 according to my count (to get the answer into a register):

$ cat t107.cu
#include <cstdio>
#include <cstdint>
__device__ unsigned r = 0;


static __device__ __inline__ uint32_t __my_bfind(uint32_t val){
  uint32_t ret;
  asm volatile("bfind.u32 %0, %1;" : "=r"(ret): "r"(val));
  return ret;}

template <typename T> __device__ T pdqru(T p, T q)
{
    return bool(p) * (((p - 1) >> (__my_bfind(q))) + 1);
}

__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
  unsigned q2 = 16;
  unsigned z = 0;
  unsigned l = 1U<<31;
  printf("result %u/%u = %u\n", p, q, pdqru(p, q));
  printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
  printf("result %u/%u = %u\n", p, z, pdqru(p, z));
  printf("result %u/%u = %u\n", z, q, pdqru(z, q));
  printf("result %u/%u = %u\n", l, q, pdqru(l, q));
  printf("result %u/%u = %u\n", q, l, pdqru(q, l));
  printf("result %u/%u = %u\n", l, l, pdqru(l, l));
  printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
  r = pdqru(p, q);
#endif
}


int main(){
  unsigned h_r;
  test<<<1,1>>>(32767, 32);
  cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
  printf("result = %u\n", h_r);
}


$ nvcc -arch=sm_61 -o t107 t107.cu -std=c++11
$ cuobjdump -sass t107

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_61

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_61
                Function : _Z4testjj
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                        /* 0x001c4400fe0007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                /* 0x4c98078000870001 */
        /*0010*/         {         MOV32I R3, 0x0;                      /* 0x010000000007f003 */
        /*0018*/                   FLO.U32 R2, c[0x0][0x144];        }  /* 0x4c30000005170002 */
                                                                        /* 0x003fd800fec007f6 */
        /*0028*/                   MOV R5, c[0x0][0x140];               /* 0x4c98078005070005 */
        /*0030*/                   IADD32I R0, R5, -0x1;                /* 0x1c0ffffffff70500 */
        /*0038*/                   SHR.U32 R0, R0, R2;                  /* 0x5c28000000270000 */
                                                                        /* 0x001fc800fca007f1 */
        /*0048*/                   IADD32I R0, R0, 0x1;                 /* 0x1c00000000170000 */
        /*0050*/                   MOV32I R2, 0x0;                      /* 0x010000000007f002 */
        /*0058*/                   ICMP.NE R0, R0, RZ, R5;              /* 0x5b4b02800ff70000 */
                                                                        /* 0x001ffc00ffe000f1 */
        /*0068*/                   STG.E [R2], R0;                      /* 0xeedc200000070200 */
        /*0070*/                   EXIT;                                /* 0xe30000000007000f */
        /*0078*/                   BRA 0x78;                            /* 0xe2400fffff87000f */
                ..........................



Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvcc -arch=sm_61 -o t107 t107.cu -std=c++11 -DUSE_DISPLAY
$ cuda-memcheck ./t107
========= CUDA-MEMCHECK
result 32767/32 = 1024
result 32767/16 = 2048
result 32767/0 = 1
result 0/32 = 0
result 2147483648/32 = 67108864
result 32/2147483648 = 1
result 2147483648/2147483648 = 1
result 32/32 = 1
result = 0
========= ERROR SUMMARY: 0 errors
$

I've only demonstrated the 32-bit example, above.

I think I could make the case that there are really only 6 instructions doing the "work" in the above kernel SASS, and that the remainder of the instructions are kernel "overhead" and/or the instructions needed to store the register result into global memory. It seems evident that the compiler is generating just these instructions as a result of the function:

        /*0018*/                   FLO.U32 R2, c[0x0][0x144];  // find bit set in q
                                                                        /*  */
        /*0028*/                   MOV R5, c[0x0][0x140];      // load p
        /*0030*/                   IADD32I R0, R5, -0x1;       // subtract 1 from p
        /*0038*/                   SHR.U32 R0, R0, R2;         // shift p right by q bit
                                                                        /*  */
        /*0048*/                   IADD32I R0, R0, 0x1;        // add 1 to result
        /*0050*/                   ...                                  /*  */
        /*0058*/                   ICMP.NE R0, R0, RZ, R5;     // account for p=0 case

However this would be inconsistent with the way I've counted other cases (they should all probably be reduced by 1).

`FLO` isn't supposed to have the same semantics as `__ffs`: `__ffs` counts _trailing_ zeros, not _leading_ zeros; you probably meant `__clz`... in the PTX (for my version) we actually have the equivalent of `clz(q & ~p)` which is weird in itself. — einpoklum, Apr 23 '17 at 09:23
I didn't say anywhere that FLO is supposed to have the same semantics as `__ffs` and I said my assumption was they were different. `FLO` is find leading one, and I was looking for the PTX instruction that would do that, and it appeared to be `bfind`. I assumed that the reason that the `__ffs()` was incorporating additional unneeded instructions was that it was due to the difference in semantics between `FLO` and `__ffs()`. The compiler doesn't seem to strip out the unneeded instructions, even though they seem unnecessary, and I haven't determined the exact reason for that. — Robert Crovella, Apr 23 '17 at 14:10

score 1 · Answer 5 · edited May 23 '17 at 12:02

One possible straightforward approach:

$ cat t105.cu
#include <cstdio>

__device__ unsigned r = 0;

template <typename T>
__device__ T pdqru(T p, T q){

  T p1 = p +  (q-1);
  if (sizeof(T) == 8)
    q = __ffsll(q);
  else
    q = __ffs(q);
  return (p1<p)?((p>>(q-1))+1) :(p1 >> (q-1));
}

__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
  unsigned q2 = 16;
  unsigned z = 0;
  unsigned l = 1U<<31;
  printf("result %u/%u = %u\n", p, q, pdqru(p, q));
  printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
  printf("result %u/%u = %u\n", p, z, pdqru(p, z));
  printf("result %u/%u = %u\n", z, q, pdqru(z, q));
  printf("result %u/%u = %u\n", l, q, pdqru(l, q));
  printf("result %u/%u = %u\n", q, l, pdqru(q, l));
  printf("result %u/%u = %u\n", l, l, pdqru(l, l));
  printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
  r = pdqru(p, q);
#endif
}


int main(){
  unsigned h_r;
  test<<<1,1>>>(32767, 32);
  cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
  printf("result = %u\n", h_r);
}


$ nvcc -arch=sm_61 -o t105 t105.cu
$ cuobjdump -sass ./t105

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_61

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_61
                Function : _Z4testjj
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                         /* 0x001fc800fec007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                 /* 0x4c98078000870001 */
        /*0010*/                   IADD R0, RZ, -c[0x0][0x144];          /* 0x4c1100000517ff00 */
        /*0018*/                   LOP.AND R0, R0, c[0x0][0x144];        /* 0x4c47000005170000 */
                                                                         /* 0x005fd401fe20003d */
        /*0028*/                   FLO.U32 R2, R0;                       /* 0x5c30000000070002 */
        /*0030*/                   MOV R0, c[0x0][0x144];                /* 0x4c98078005170000 */
        /*0038*/                   IADD32I R3, -R2, 0x1f;                /* 0x1d00000001f70203 */
                                                                         /* 0x001fd000fc2007f1 */
        /*0048*/                   IADD32I R0, R0, -0x1;                 /* 0x1c0ffffffff70000 */
        /*0050*/                   MOV R2, c[0x0][0x140];                /* 0x4c98078005070002 */
        /*0058*/                   IADD32I R4, -R3, 0x1f;                /* 0x1d00000001f70304 */
                                                                         /* 0x001fd800fe2007f6 */
        /*0068*/                   IADD R5, R0, c[0x0][0x140];           /* 0x4c10000005070005 */
        /*0070*/                   ISETP.LT.U32.AND P0, PT, R5, R0, PT;  /* 0x5b62038000070507 */
        /*0078*/                   SHR.U32 R0, R2, R4;                   /* 0x5c28000000470200 */
                                                                         /* 0x001fd000fc2007f1 */
        /*0088*/                   IADD32I R0, R0, 0x1;                  /* 0x1c00000000170000 */
        /*0090*/                   MOV32I R2, 0x0;                       /* 0x010000000007f002 */
        /*0098*/                   MOV32I R3, 0x0;                       /* 0x010000000007f003 */
                                                                         /* 0x001ffc001e2007f2 */
        /*00a8*/              @!P0 SHR.U32 R0, R5, R4;                   /* 0x5c28000000480500 */
        /*00b0*/                   STG.E [R2], R0;                       /* 0xeedc200000070200 */
        /*00b8*/                   EXIT;                                 /* 0xe30000000007000f */
                                                                         /* 0x001f8000fc0007ff */
        /*00c8*/                   BRA 0xc0;                             /* 0xe2400fffff07000f */
        /*00d0*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*00d8*/                   NOP;                                  /* 0x50b0000000070f00 */
                                                                         /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                  /* 0x50b0000000070f00 */
                ..........................



Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvcc -arch=sm_61 -o t105 t105.cu -DUSE_DISPLAY
$ cuda-memcheck ./t105
========= CUDA-MEMCHECK
result 32767/32 = 1024
result 32767/16 = 2048
result 32767/0 = 2048
result 0/32 = 0
result 2147483648/32 = 67108864
result 32/2147483648 = 1
result 2147483648/2147483648 = 1
result 32/32 = 1
result = 0
========= ERROR SUMMARY: 0 errors
$

Approximately 14 SASS instructions for the 32-bit case, to get the answer into R0. It produces spurious results for the divide-by-zero case.

The equivalent assembly for this answer case looks like this:

$ cat t106.cu
#include <cstdio>
#include <cstdint>
__device__ unsigned r = 0;


template <typename T> __device__ int find_first_set(T x);
template <> __device__ int find_first_set<uint32_t>(uint32_t x) { return __ffs(x);   }
template <> __device__ int find_first_set<uint64_t>(uint64_t x) { return __ffsll(x); }

template <typename T>  __device__ T lg(T x) { return find_first_set(x) - 1; }

template <typename T>
__device__ T pdqru(T dividend, T divisor)
{
    auto log_2_of_divisor = lg(divisor);
    auto mask = divisor - 1;
    auto correction_for_rounding_up = ((dividend & mask) + mask) >> log_2_of_divisor;

    return (dividend >> log_2_of_divisor) + correction_for_rounding_up;
}

__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
  unsigned q2 = 16;
  unsigned z = 0;
  unsigned l = 1U<<31;
  printf("result %u/%u = %u\n", p, q, pdqru(p, q));
  printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
  printf("result %u/%u = %u\n", p, z, pdqru(p, z));
  printf("result %u/%u = %u\n", z, q, pdqru(z, q));
  printf("result %u/%u = %u\n", l, q, pdqru(l, q));
  printf("result %u/%u = %u\n", q, l, pdqru(q, l));
  printf("result %u/%u = %u\n", l, l, pdqru(l, l));
  printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
  r = pdqru(p, q);
#endif
}


int main(){
  unsigned h_r;
  test<<<1,1>>>(32767, 32);
  cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
  printf("result = %u\n", h_r);
}


$ nvcc -std=c++11  -arch=sm_61 -o t106 t106.cu
$ cuobjdump -sass t106

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_61

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_61
                Function : _Z4testjj
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                   /* 0x001fd400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];           /* 0x4c98078000870001 */
        /*0010*/                   IADD R0, RZ, -c[0x0][0x144];    /* 0x4c1100000517ff00 */
        /*0018*/                   MOV R2, c[0x0][0x144];          /* 0x4c98078005170002 */
                                                                   /* 0x003fc40007a007f2 */
        /*0028*/                   LOP.AND R0, R0, c[0x0][0x144];  /* 0x4c47000005170000 */
        /*0030*/                   FLO.U32 R3, R0;                 /* 0x5c30000000070003 */
        /*0038*/                   IADD32I R0, R2, -0x1;           /* 0x1c0ffffffff70200 */
                                                                   /* 0x001fc400fcc017f5 */
        /*0048*/                   IADD32I R3, -R3, 0x1f;          /* 0x1d00000001f70303 */
        /*0050*/                   LOP.AND R2, R0, c[0x0][0x140];  /* 0x4c47000005070002 */
        /*0058*/                   IADD R2, R0, R2;                /* 0x5c10000000270002 */
                                                                   /* 0x001fd000fe2007f1 */
        /*0068*/                   IADD32I R0, -R3, 0x1f;          /* 0x1d00000001f70300 */
        /*0070*/                   MOV R3, c[0x0][0x140];          /* 0x4c98078005070003 */
        /*0078*/                   MOV32I R6, 0x0;                 /* 0x010000000007f006 */
                                                                   /* 0x001fc400fc2407f1 */
        /*0088*/                   SHR.U32 R4, R2, R0.reuse;       /* 0x5c28000000070204 */
        /*0090*/                   SHR.U32 R5, R3, R0;             /* 0x5c28000000070305 */
        /*0098*/                   MOV R2, R6;                     /* 0x5c98078000670002 */
                                                                   /* 0x0003c400fe4007f4 */
        /*00a8*/                   MOV32I R3, 0x0;                 /* 0x010000000007f003 */
        /*00b0*/                   IADD R0, R4, R5;                /* 0x5c10000000570400 */
        /*00b8*/                   STG.E [R2], R0;                 /* 0xeedc200000070200 */
                                                                   /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT;                           /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0;                       /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                            /* 0x50b0000000070f00 */
                                                                   /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                            /* 0x50b0000000070f00 */
                ..........................



Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

which appears to be 1 instruction longer, by my count.

It wasn't an along-the-way edit - it's the whole point of the question (and of the original question); since otherwise, indeed, the straightforward approach comes to mind relatively easily, and it doesn't seem like there could be overly interesting alternatives. — einpoklum, Apr 22 '17 at 22:08
I've modified it to handle overflow. It adds a few instructions. — Robert Crovella, Apr 22 '17 at 23:02
The SASS instructions count for the 32 bit case is basically the same as your implementation. — Robert Crovella, Apr 22 '17 at 23:09

njuffa · Answer 6 · 2017-04-23T23:55:35.937

Here is an alternative solution via population count. I tried the 32-bit variant only, testing it exhaustively against the reference implementation. Since the divisor q is a power of 2, we can trivially determine the shift count s with the help of the population count operation. The remainder t of the truncating division can be computed by simple mask m derived directly from the divisor q.

// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t reference (uint32_t p, uint32_t q)
{
    uint32_t r = p / q;
    if ((q * r) < p) r++;
    return r;
}

// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t solution (uint32_t p, uint32_t q)
{
    uint32_t r, s, t, m;
    m = q - 1;
    s = __popc (m);
    r = p >> s;
    t = p & m;
    if (t > 0) r++;
    return r;
}

Whether solution() is faster than the previously posted codes will likely depend on the specific GPU architecture. Using CUDA 8.0, it compiles to the following sequence of PTX instructions:

add.s32         %r3, %r2, -1;
popc.b32        %r4, %r3;
shr.u32         %r5, %r1, %r4;
and.b32         %r6, %r3, %r1;
setp.ne.s32     %p1, %r6, 0;
selp.u32        %r7, 1, 0, %p1;
add.s32         %r8, %r5, %r7;

For sm_5x, this translates into machine code pretty much 1:1, except that the two instructions SETP and SELP get contracted into a single ICMP, because the comparison is with 0.

I don't either... I only have a 5.2, which I'll probably use next week for some performance comparisons. but you could still post the PTX and/or the SASS, with Robert's kernel, so that we see what it looks like. — einpoklum, Apr 23 '17 at 22:56
Oh, wait, you've essentially just replaced the `lg()` implementation, and other than that you're doing the more-straightforward version of the round-up-correction. Now that I think about it, maybe I should just replace the `lg()` with your popcnt-based code, It's almost certainly better. — einpoklum, Apr 23 '17 at 22:59
@einpoklum I added the generated PTX code, per your suggestion. — njuffa, Apr 23 '17 at 23:56

Efficiently dividing unsigned value by a power of two, rounding up - in CUDA

6 Answers6

Linked