23

At some point in my program I compute an integer divisor d. From that point onward d is going to be constant.

Later in the code I will divide by that d several times - performing an integer division, since the value of d is not a compile-time known constant.

Given that integer division is a relatively slow process compared to other kind of integer arithmetic, I would like to optimize it. Is there some alternative format that I could store d in, so that the division process would perform faster? Maybe a reciprocal of some form?

I do not need the value of d for anything else.

The value of d is any 64-bit integer, but usually fits in 32-bit quite well.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
CygnusX1
  • 20,968
  • 5
  • 65
  • 109
  • You may implement your own bitshift optimizer; and that's the only kind of constant optimization I know. – Tatsuyuki Ishi Jul 27 '17 at 14:26
  • 1
    My initial thoughts are centred around division being necessarily slow, and it's unlikely you'll beat the method your CPU uses. I think, in other words, you just have to suck it. Upvoted for intrigue though. – Bathsheba Jul 27 '17 at 14:27
  • 17
    Is this "slow" division really your bottleneck? – François Andrieux Jul 27 '17 at 14:27
  • 2
    do nothing. integer division takes few dozen CPU cycles anyway. you're optimizing nanoseconds here. – David Haim Jul 27 '17 at 14:29
  • 2
    @DavidHaim do nothing is not an attitude. In my area, for example, nanoseconds count. We measure responce time in nanos. – SergeyA Jul 27 '17 at 14:34
  • @SergeyA: In such cases you'd probably look to avoiding the division in the first place, and scale up the problem accordingly, if you get my meaning, especially if you can stick to the current type. – Bathsheba Jul 27 '17 at 14:35
  • @CygnusX1 Hardware specs / used c++ implementation etc. would be interesting though – clickMe Jul 27 '17 at 14:36
  • 1
    @Bathsheba, I certainly do. But sometimes you have to divide, because you have external data which needs to be scaled, but you have to deal with other, unscaled data as well. – SergeyA Jul 27 '17 at 14:36
  • 1
    @FrançoisAndrieux I don't think that's unlikely. 64bit division on x64 takes almost two orders or magnitude longer than any reasonable instruction. – harold Jul 27 '17 at 14:40
  • @harold Not all programs are CPU bound. If the program does even light I/O the division may be noise. – Mark B Jul 27 '17 at 14:41
  • Pure theoretical optimization is useless in this case. If you need to optimize concrete code show it, otherwise this is offtopic IMHO. If somebody would know faster generic way to do integer division why they would not implement it in CPU? – Slava Jul 27 '17 at 14:43
  • 8
    @MarkB, sure. But why the knee-jerk reaction implying OP doesn't know what they are doing? Can we at least give OP the benefit of the doubt? – SergeyA Jul 27 '17 at 14:44
  • 5
    Honestly I think these comments once again show that a performance question shouldn't be tagged with a language tag, that always brings out the "optimization is pointless"-crowd. Let this be a lesson for OP.. – harold Jul 27 '17 at 14:47
  • 1
    @harold, I am also puzzled by reaction. People who do not worry about performance shouldn't even use C++, they should code in Python... – SergeyA Jul 27 '17 at 14:48
  • @Slava because it would take either enough pre-processing to negate any savings, or an unreasonable large lookup table. Doing it in software allows you to pre-process only *once* because you know you will reuse it. – harold Jul 27 '17 at 14:53
  • 1
    @SergeyA they mostly code in Java, JS, python and C# ... ... the problem with them is, that they are actually productive, so in the end I happen to use their SW for most of the day, crying every time, watching as a bit(\*) beefed up text editor (IDE) needs 3GB of RAM... (\* OK, it's actually quite smart text editor... but still the 1980-1990 coders would probably have heart attack, if somebody would come from future and show them...) – Ped7g Jul 27 '17 at 14:53
  • There are 2 answers and according to "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam." both of them offtopic. So I was right this question is offtopic indeed. – Slava Jul 27 '17 at 15:09
  • 1
    @Slava: you're wrong. I could give a better answer, which is actually an algorithm, or a code snippet to do this. Or even I could copy-paste a relevant code snippet from libdivide. However, I don't have that much time now. – geza Jul 27 '17 at 15:52
  • @geza it is not matter of better or worse, questions that require recommendation of book or lib considered offtopic as they attract opinionated answers and spam. That what rule says. And we can clearly see here spam - about who should use C++ or not and about Java, JS and C# developers. – Slava Jul 27 '17 at 16:34
  • @Slava: I've added an example solution for this problem. As you see, it is not a recommendation of a book or lib. – geza Jul 27 '17 at 17:03
  • 4
    @Slava Questions *asking* for recommendations of books or libraries are off-topic. However, that does not mean that a question is off-topic just because it can be *answered* with a recommendation of a library. That's a perfectly valid answer to a question that asks how to do something. – Cody Gray - on strike Jul 28 '17 at 11:01

3 Answers3

32

There is a library for this—libdivide:

libdivide is an open source library for optimizing integer division

libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. Furthermore, libdivide allows you to divide an SSE2 vector by a runtime constant, which is especially nice because SSE2 has no integer division instructions!

libdivide is free and open source with a permissive license. The name "libdivide" is a bit of a joke, as there is no library per se: the code is packaged entirely as a single header file, with both a C and a C++ API.

You can read about the algorithm behind it at this blog; for example, this entry.

Basically, the algorithm behind it is the same one that compilers use to optimize division by a constant, except that it allows these strength-reduction optimizations to be done at run-time.

Note: you can create an even faster version of libdivide. The idea is that for every divisor, you can always create a triplet (mul/add/shift), so this expression gives the result: (num*mul+add)>>shift (multiply is a wide multiply here). Interestingly, this method could beat the compiler version for constant division for several microbenchmarks!


Here's my implementation (this is not compilable out of the box, but the general algorithm can be seen):

struct Divider_u32 {
    u32 mul;
    u32 add;
    s32 shift;

    void set(u32 divider);
};

void Divider_u32::set(u32 divider) {
    s32 l = indexOfMostSignificantBit(divider);
    if (divider&(divider-1)) {
        u64 m = static_cast<u64>(1)<<(l+32);
        mul = static_cast<u32>(m/divider);

        u32 rem = static_cast<u32>(m)-mul*divider;
        u32 e = divider-rem;

        if (e<static_cast<u32>(1)<<l) {
            mul++;
            add = 0;
        } else {
            add = mul;
        }
        shift = l;
    } else {
        if (divider==1) {
            mul = 0xffffffff;
            add = 0xffffffff;
            shift = 0;
        } else {
            mul = 0x80000000;
            add = 0;
            shift = l-1;
        }
    }
}

u32 operator/(u32 v, const Divider_u32 &div) {
    u32 t = static_cast<u32>((static_cast<u64>(v)*div.mul+div.add)>>32)>>div.shift;

    return t;
}
geza
  • 28,403
  • 6
  • 61
  • 135
  • Looks promising, but the normal rule: don't make it link only. – Tatsuyuki Ishi Jul 27 '17 at 14:38
  • What the deuce? If a compiler is going to use it anyway then what is the point in obfuscating your source code? – Bathsheba Jul 27 '17 at 14:44
  • @Bathsheba: because the compiler won't optimize it, as it is not a compile time constant. – geza Jul 27 '17 at 14:45
  • @geza: Ok, clearly this is above my pay grade. Downvote converted to upvote. Plus the answer is cuter now than it was. – Bathsheba Jul 27 '17 at 14:45
  • 3
    @CodyGray is running for moderator. Vote him in. – Bathsheba Jul 27 '17 at 14:47
  • 2
    You are very welcome! Thanks to both of you for the support! For what it's worth, I think libdivide is an excellent solution here. I saw it several years ago, and always wanted to find a good use for it because it looked like a neat idea. Excerpting the relevant bits from the link is a good, simple way to keep an answer from being link-only. – Cody Gray - on strike Jul 27 '17 at 14:55
  • @geza: Can you show an example for your claim that "*this method could beat the compiler version for constant division for several microbenchmarks!*"? Does it still happen with `-mtune=native`? If so, that's a missed-optimization bug. (Writing it out manually will probably compile more like `clang`, and use `imul r64,r64`. IIRC, gcc likes to use `mul r32` even in 64-bit mode, which is slower on recent Intel but faster on Atom and AMD Bulldozer-family). – Peter Cordes Jul 29 '17 at 07:17
  • @PeterCordes: I measured this about two years ago. The algorithm was the same as I written in my answer, but maybe it was the 64-bit version. I'm not saying that it was always faster. I created some microbenchmarks, and for several of them, this method was faster (it was basically dividing numbers in a loop, and summing the result). The difference wasn't huge, something like 10-20%. If you're interested in another missing optimization, look at this: https://stackoverflow.com/questions/45298870/why-does-loop-alignment-on-32-byte-make-code-faster. Removing volatile here makes the code slow down. – geza Jul 29 '17 at 08:40
  • @PeterCordes: this volatile thing could be a little bit similar as this case. Maybe, because with my version, everything was in a register, so it was faster, than a version, where some constants was embedded in the code (for the linked issue, this happens. Presumably, `"mov reg, const"` is slower than `"mov reg, reg"`) – geza Jul 29 '17 at 08:43
  • @geza: Anyway, your case with multiply sounds is probably a case of writing the same thing differently leading to a different code-gen strategy. You can report compiler bugs for things like that. In [gcc's bugzilla](https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc), use the missed-optimization keyword. – Peter Cordes Jul 29 '17 at 10:54
  • Assuming two's complement math, the two lines beginning with `rem = ` and `e =` could be replaced with `e = divider + static_cast(mul*divider);` . (Since the lower 32 bits of m == 0). – rcgldr Aug 01 '17 at 07:26
6

The book "Hacker's delight" has "Chapter 10: Integer division by constant" spanning 74 pages. You can find all the code examples for free in this directory: http://www.hackersdelight.org/hdcode.htm In your case, Figs. 10-1., 10-2 and 10-3 are what you want.

The problem of dividing by a constant d is equivalent to mutiplying by c = 1/d. These algorithms calculate such a constant for you. Once you have c, you calculate the dividend as such:

int divideByMyConstant(int dividend){
  int c = MAGIC; // Given by the algorithm

  // since 1/d < 1, c is actually (1<<k)/d to fit nicely ina 32 bit int
  int k = MAGIC_SHIFT; //Also given by the algorithm

  long long tmp = (long long)dividend * c; // use 64 bit to hold all the precision...

  tmp >>= k; // Manual floating point number =)

  return (int)tmp;
}
ZeroUltimax
  • 256
  • 1
  • 9
  • 1
    You need to cast before assignment. `long long tmp = div * c` will still evaluate `dividend * c` as `int`, and *then* sign-extend the `int` result to `long long`, so you lose the upper-half result. Also, you should probably be using unsigned. – Peter Cordes Jul 29 '17 at 07:27
  • You were correct about the cast. I made the edit. On the subject of "signedness", the book proposes two algorithms, one for signed and one for unsigned division. Thanks for the feedback =) – ZeroUltimax Jul 31 '17 at 13:57
  • Oh, so does `tmp >>= k` need to be an arithmetic shift? In practice most implementations will give you that, but the C++ standard just says that right-shifts of signed integers are implementation-dependent (i.e. they could be logical right shifts that shift in zeros instead of copies of the sign bit (2's complement arithmetic shift)... or whatever is needed to make an arithmetic shift on one's complement or sign/magnitude systems.) – Peter Cordes Jul 31 '17 at 17:29
  • 1
    You're right, it should be unsigned. I was cooking from memory, so my algorithm isn't exact. The magic number actually takes care of the signedness. I'll write a working example. – ZeroUltimax Aug 01 '17 at 15:13
  • That link is dead. Could someone give the value for `MAGIC` and `MAGIC_SHIFT` ? – Huy Le Sep 08 '22 at 09:22
2

update - in my original answer, I noted an algorithm mentioned in a prior thread for compiler generated code for divide by constant. The assembly code was written to match a document linked to from that prior thread. The compiler generated code involves slightly different sequences depending on the divisor.

In this situation, the divisor is not known until runtime, so a common algorithm is desired. The example in geza's answer shows a common algorithm, which could be inlined in assembly code with GCC, but Visual Studio doesn't support inline assembly in 64 bit mode. In the case of Visual Studio, there's a trade off between the extra code involved if using intrinsics, versus calling a function written in assembly. On my system (Intel 3770k 3.5ghz) I tried calling a single function that does |mul add adc shr|, and I also tried using a pointer to function to optionally use shorter sequences |mul shr| or |shr(1) mul shr| depending on the divisor, but this provided little or no gain, depending on the compiler. The main overhead in this case is the call (versus |mul add adc shr| ). Even with the call overhead, the sequence|call mul add adc shr ret| averaged about 4 times as fast as divide on my system.

Note that the linked to source code for libdivide in geza's answer does not have a common routine that can handle a divisor == 1. The libdivide common sequence is multiply, subtract, shift(1), add, shift, versus geza's example c++ sequence of multiply, add, adc, shift.


My original answer: the example code below uses the algorithm described in a prior thread.

Why does GCC use multiplication by a strange number in implementing integer division?

This is a link to the document mentioned in the other thread:

http://gmplib.org/~tege/divcnst-pldi94.pdf

The example code below is based on the pdf document and is meant for Visual Studio, using ml64 (64 bit assembler), running on Windows (64 bit OS). The code with labels main... and dcm... is the code to generate a preshift (rspre, number of trailing zero bits in divisor), multiplier, and postshift (rspost). The code with labels dct... is the code to test the method.

        includelib      msvcrtd
        includelib      oldnames

sw      equ     8                       ;size of word

        .data
arrd    dq      1                       ;array of test divisors
        dq      2
        dq      3
        dq      4
        dq      5
        dq      7
        dq      256
        dq      3*256
        dq      7*256
        dq      67116375
        dq      07fffffffffffffffh      ;max divisor
        dq      0
        .data?

        .code
        PUBLIC  main

main    PROC
        push    rbp
        push    rdi
        push    rsi
        sub     rsp,64                  ;allocate stack space
        mov     rbp,rsp
        lea     rsi,arrd                ;set ptr to array of divisors
        mov     [rbp+6*sw],rsi
        jmp     main1

main0:  mov     [rbp+0*sw],rbx          ;[rbp+0*sw] = rbx = divisor = d
        cmp     rbx,1                   ;if d <= 1, q=n or overflow
        jbe     main1
        bsf     rcx,rbx                 ;rcx = rspre
        mov     [rbp+1*sw],rcx          ;[rbp+1*sw] = rspre
        shr     rbx,cl                  ;rbx = d>>rsc
        bsr     rcx,rbx                 ;rcx = floor(log2(rbx))
        mov     rsi,1                   ;rsi = 2^floor(log2(rbx))
        shl     rsi,cl
        cmp     rsi,rbx                 ;br if power of 2
        je      dcm03
        inc     rcx                     ;rcx = ceil(log2(rcx)) = L = rspost
        shl     rsi,1                   ;rsi = 2^L
;       jz      main1                   ;d > 08000000000000000h, use compare
        add     rcx,[rbp+1*sw]          ;rcx = L+rspre
        cmp     rcx,8*sw                ;if d > 08000000000000000h, use compare
        jae     main1
        mov     rax,1                   ;[rbp+3*sw] = 2^(L+rspre)
        shl     rax,cl
        mov     [rbp+3*sw],rax
        sub     rcx,[rbp+1*sw]          ;rcx = L
        xor     rdx,rdx
        mov     rax,rsi                 ;hi N bits of 2^(N+L)
        div     rbx                     ;rax == 1
        xor     rax,rax                 ;lo N bits of 2^(N+L)
        div     rbx
        mov     rdi,rax                 ;rdi = mlo % 2^N
        xor     rdx,rdx
        mov     rax,rsi                 ;hi N bits of 2^(N+L) + 2^(L+rspre)
        div     rbx                     ;rax == 1
        mov     rax,[rbp+3*sw]          ;lo N bits of 2^(N+L) + 2^(L+rspre)
        div     rbx                     ;rax = mhi % 2^N
        mov     rdx,rdi                 ;rdx = mlo % 2^N
        mov     rbx,8*sw                ;rbx = e = # bits in word
dcm00:  mov     rsi,rdx                 ;rsi = mlo/2
        shr     rsi,1
        mov     rdi,rax                 ;rdi = mhi/2
        shr     rdi,1
        cmp     rsi,rdi                 ;break if mlo >= mhi
        jae     short dcm01
        mov     rdx,rsi                 ;rdx = mlo/2
        mov     rax,rdi                 ;rax = mhi/2
        dec     rbx                     ;e -= 1
        loop    dcm00                   ;loop if --shpost != 0
dcm01:  mov     [rbp+2*sw],rcx          ;[rbp+2*sw] = shpost
        cmp     rbx,8*sw                ;br if N+1 bit multiplier
        je      short dcm02
        xor     rdx,rdx
        mov     rdi,1                   ;rax = m = mhi + 2^e
        mov     rcx,rbx
        shl     rdi,cl
        or      rax,rdi
        jmp     short dct00

dcm02:  mov     rdx,1                   ;rdx = 2^N
        dec     qword ptr [rbp+2*sw]    ;dec rspost
        jmp     short dct00

dcm03:  mov     rcx,[rbp+1*sw]          ;rcx = rsc
        jmp     short dct10

;       test    rbx = n, rdx = m bit N, rax = m%(2^N)
;               [rbp+1*sw] = rspre, [rbp+2*sw] = rspost

dct00:  mov     rdi,rdx                 ;rdi:rsi = m
        mov     rsi,rax
        mov     rbx,0fffffffff0000000h  ;[rbp+5*sw] = rbx = n
dct01:  mov     [rbp+5*sw],rbx
        mov     rdx,rdi                 ;rdx:rax = m
        mov     rax,rsi
        mov     rcx,[rbp+1*sw]          ;rbx = n>>rspre
        shr     rbx,cl
        or      rdx,rdx                 ;br if 65 bit m
        jnz     short dct02
        mul     rbx                     ;rdx = (n*m)>>N
        jmp     short dct03

dct02:  mul     rbx
        sub     rbx,rdx
        shr     rbx,1
        add     rdx,rbx
dct03:  mov     rcx,[rbp+2*sw]          ;rcx = rspost
        shr     rdx,cl                  ;rdx = q = quotient
        mov     [rbp+4*sw],rdx          ;[rbp+4*sw] = q
        xor     rdx,rdx                 ;rdx:rax = n
        mov     rax,[rbp+5*sw]
        mov     rbx,[rbp+0*sw]          ;rbx = d
        div     rbx                     ;rax = n/d
        mov     rdx,[rbp+4*sw]          ;br if ok
        cmp     rax,rdx                 ;br if ok
        je      short dct04
        nop                             ;debug check
dct04:  mov     rbx,[rbp+5*sw]
        inc     rbx
        jnz     short dct01
        jmp     short main1

;       test    rbx = n, rcx = rsc
dct10:  mov     rbx,0fffffffff0000000h  ;rbx = n
dct11:  mov     rsi,rbx                 ;rsi = n
        shr     rsi,cl                  ;rsi = n>>rsc
        xor     edx,edx
        mov     rax,rbx
        mov     rdi,[rbp+0*sw]          ;rdi = d
        div     rdi
        cmp     rax,rsi                 ;br if ok
        je      short dct12
        nop
dct12:  inc     rbx
        jnz     short dct11

main1:  mov     rsi,[rbp+6*sw]          ;rsi ptr to divisor
        mov     rbx,[rsi]               ;rbx = divisor = d
        add     rsi,1*sw                ;advance ptr
        mov     [rbp+6*sw],rsi
        or      rbx,rbx
        jnz     main0                   ;br if not end table

        add     rsp,64                  ;restore regs
        pop     rsi
        pop     rdi
        pop     rbp
        xor     rax,rax
        ret     0

main    ENDP
        END
rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • That's a pretty big block of code. It would be better if you split out a function or block that performs a division using the pre-generated constants. I guess there could be multiple strategies: do everything, or branch to run only the simple version for divisors that don't need a lot of extra work. – Peter Cordes Jul 29 '17 at 07:45
  • 1
    @PeterCordes - after reviewing the pdf file linked to in geza's answer, there's an alternate approach for dealing with divisors like 7. The "standard" approach uses multiply, subtract, shift (1), add, shift, while the alternate approach uses multiply, add, adc, shift or add, sbb, multiply, shift. For most divisors, the "standard" approach is multiply, shift. Since the divisor is not known at compile time, which approach to use has to be determined at run time. Near duplicate code could be used to avoid conditional branching or using a function to pointer (to call one of two routines). – rcgldr Jul 29 '17 at 17:56
  • Hmm, an indirect jump is interesting, but a function-pointer has the downside of forcing a spill of all non-call-preserved registers, unless the compiler knows that the pointer will always be one of two functions... Doesn't gcc support indirect `goto` syntax? You could use that to get an unconditional indirect `jmp` within an inlined function. An unconditional indirect jump is barely worse than a conditional branch, as far as branch-prediction goes. (And might actually be worse in some cases, like with cold predictors.) But mostly from being harder to write in C that will optimize well. – Peter Cordes Jul 29 '17 at 18:08
  • I'm picturing that the OP wants something that could inline into multiple call-sites. It's simpler if the divisor is only used in one place. – Peter Cordes Jul 29 '17 at 18:09
  • There is an intrinsic for `adc`, but it doesn't compile to good asm with gcc last time I tried. It tends to `setc` the flag result into an integer register, then use a `cmp` or something to get it back into CF before the next `adc`. – Peter Cordes Jul 30 '17 at 00:58
  • 1
    @PeterCordes - I looked at libdivide, and it uses multiple algorithms. It has a "branchfree" method that is the same as the N+1 method in my answer, and not the multiply, add, shift in geza's answer. I'll have to try that one. I deleted most of my prior comments. – rcgldr Aug 01 '17 at 00:53
  • @PeterCordes - I updated my answer. geza's answer appears to be the best common solution method, eliminating the need for conditionals. – rcgldr Aug 02 '17 at 02:41
  • Cool, that's some useful info about what's good at compile-time vs. run-time. – Peter Cordes Aug 02 '17 at 02:44