6

For my BigInteger code, output turned out to be slow for very large BigIntegers. So now I use a recursive divide-and-conquer algorithm, which still needs 2'30" to convert the currently largest known prime to a decimal string of more than 22 million digits (but only 135 ms to turn it into a hexadecimal string).

I still want to reduce the time, so I need a routine that can divide a NativeUInt (i.e. UInt32 on 32 bit platforms, UInt64 on 64 bit platforms) by 100 very fast. So I use multiplication by constant. This works fine in 32 bit code, but I am not 100% sure for 64 bit.

So my question: is there a way to check the reliability of the results of multiplication by constant for unsigned 64 bit values? I checked the 32 bit values by simply trying with all values of UInt32 (0..$FFFFFFFF). This took approx. 3 minutes. Checking all UInt64s would take much longer than my lifetime. Is there a way to check if the parameters used (constant, post-shift) are reliable?

I noticed that DivMod100() always failed for a value like $4000004B if the chosen parameters were wrong (but close). Are there special values or ranges to check for 64 bit, so I don't have to check all values?

My current code:

const
{$IF DEFINED(WIN32)}
  // Checked
  Div100Const = UInt32(UInt64($1FFFFFFFFF) div 100 + 1);
  Div100PostShift = 5;
{$ELSEIF DEFINED(WIN64)}
  // Unchecked!!
  Div100Const = $A3D70A3D70A3D71; 
  // UInt64(UInt128($3 FFFF FFFF FFFF FFFF) div 100 + 1); 
  // UInt128 is fictive type.
  Div100PostShift = 2;
{$IFEND}

// Calculates X div 100 using multiplication by a constant, taking the
// high part of the 64 bit (or 128 bit) result and shifting
// right. The remainder is calculated as X - quotient * 100;
// This was tested to work safely and quickly for all values of UInt32.
function DivMod100(var X: NativeUInt): NativeUInt;
{$IFDEF WIN32}
asm
        // EAX = address of X, X is UInt32 here.
        PUSH    EBX
        MOV     EDX,Div100Const
        MOV     ECX,EAX
        MOV     EAX,[ECX]
        MOV     EBX,EAX
        MUL     EDX
        SHR     EDX,Div100PostShift
        MOV     [ECX],EDX       // Quotient

        // Slightly faster than MUL

        LEA     EDX,[EDX + 4*EDX] // EDX := EDX * 5;
        LEA     EDX,[EDX + 4*EDX] // EDX := EDX * 5;
        SHL     EDX,2             // EDX := EDX * 4; 5*5*4 = 100.

        MOV     EAX,EBX
        SUB     EAX,EDX         // Remainder
        POP     EBX
end;
{$ELSE WIN64}
asm
        .NOFRAME

        // RCX is address of X, X is UInt64 here.
        MOV     RAX,[RCX]
        MOV     R8,RAX
        XOR     RDX,RDX
        MOV     R9,Div100Const
        MUL     R9
        SHR     RDX,Div100PostShift
        MOV     [RCX],RDX      // Quotient

        // Faster than LEA and SHL

        MOV     RAX,RDX
        MOV     R9D,100
        MUL     R9
        SUB     R8,RAX
        MOV     RAX,R8         // Remainder
end;
{$ENDIF WIN32}
Rudy Velthuis
  • 28,387
  • 5
  • 46
  • 94
  • This is close to a dupe of http://stackoverflow.com/questions/20270596/ but in any case you'll find the answer there by reading libdivide – David Heffernan Jan 30 '16 at 11:14
  • I used libdivide to generate a constant, but it amounts to `$1C0000000000000000 div 100 + 1` with a post-shift of 6, but the result is not `n div 100` in the high part. libdivide gives the results I expect for 32 bit, but perhaps I don't understand the way it is used for 64 bit. I'll experiment a little more. – Rudy Velthuis Jan 30 '16 at 14:24
  • Please provide an answer. – David Heffernan Jan 30 '16 at 15:04
  • @DavidHeffernan: Ok, I found how to do it properly, using libdivide.h. Apparently there is a shift/add step required. Works fine now. Should I post the solution as an answer, or just edit the question? – Rudy Velthuis Jan 30 '16 at 15:23
  • OK, will post the answer. – Rudy Velthuis Jan 30 '16 at 15:24

3 Answers3

2

As usual when writing optimized code, use compiler output for hints / starting points. It's safe to assume any optimization it makes is safe in the general case. Wrong-code compiler bugs are rare.

gcc implements unsigned 64bit divmod with a constant of 0x28f5c28f5c28f5c3. I haven't looked in detail into generating constants for division, but there are algorithms for generating them that will give known-good results (so exhaustive testing isn't needed).

The code actually has a few important differences: it uses the constant differently from the OP's constant.

See the comments for an analysis of what this is is actually doing: divide by 4 first, so it can use a constant that only works for dividing by 25 when the dividend is small enough. This also avoids needing an add at all, later on.

#include <stdint.h>

// rem, quot ordering takes one extra instruction
struct divmod { uint64_t quotient, remainder; }
 div_by_100(uint64_t x) {
    struct divmod retval = { x%100, x/100 };
    return retval;
}

compiles to (gcc 5.3 -O3 -mtune=haswell):

    movabs  rdx, 2951479051793528259
    mov     rax, rdi            ; Function arg starts in RDI (SysV ABI)
    shr     rax, 2
    mul     rdx
    shr     rdx, 2
    lea     rax, [rdx+rdx*4]    ; multiply by 5
    lea     rax, [rax+rax*4]    ; multiply by another 5
    sal     rax, 2              ; imul rax, rdx, 100 is better here (Intel SnB).
    sub     rdi, rax
    mov     rax, rdi
    ret
; return values in rdx:rax

Use the "binary" option to see the constant in hex, since disassembler output does it that way, unlike gcc's asm source output.


The multiply-by-100 part.

gcc uses the above sequence of lea/lea/shl, the same as in your question. Your answer is using a mov imm/mul sequence.

Your comments each say the version they chose is faster. If so, it's because of some subtle instruction alignment or other secondary effect: On Intel SnB-family, it's the same number of uops (3), and the same critical-path latency (mov imm is off the critical path, and mul is 3 cycles).

clang uses what I think is the best bet (imul rax, rdx, 100). I thought of it before I saw that clang chose it, not that that matters. That's 1 fused-domain uop (which can execute on p0 only), still with 3c latency. So if you're latency-bound using this routine for multi-precision, it probably won't help, but it is the best choice. (If you're latency-bound, inlining your code into a loop instead of passing one of the parameters through memory could save a lot of cycles.)

imul works because you're only using the low 64b of the result. There's no 2 or 3 operand form of mul because the low half of the result is the same regardless of signed or unsigned interpretation of the inputs.

BTW, clang with -march=native uses mulx for the 64x64->128, instead of mul, but doesn't gain anything by it. According to Agner Fog's tables, it's one cycle worse latency than mul.


AMD has worse than 3c latency for imul r,r,i (esp. the 64b version), which is maybe why gcc avoids it. IDK how much work gcc maintainers put into tweaking costs so settings like -mtune=haswell work well, but a lot of code isn't compiled with any -mtune setting (even one implied by -march), so I'm not surprised when gcc makes choices that were optimal for older CPUs, or for AMD.

clang still uses imul r64, r64, imm with -mtune=bdver1 (Bulldozer), which saves m-ops but at a cost of 1c latency more than using lea/lea/shl. (lea with a scale>1 is 2c latency on Bulldozer).

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • @user246408: If I'm following the comments on your answer correct, clang and gcc probably do it this way so they don't need a 65bit add-and-shift, right? This does look cheaper than than Rudy's code. – Peter Cordes Feb 01 '16 at 06:59
  • 1
    Sorry, I've deleted my comments; because I did not look enough into the gcc code; yes, it uses only shift, not add-and-shift; the constant `0x28f5c28f5c28f5c3` is the same most signicant bits of reciprocal of 1/25 (or 1/100, they are the same), only bit-shifted right by 2 bits. Given the gcc code is correct, it uses optimization based on the the fact that the dividend after preshifting right 2 bits is less than 0x4000000000000000. Summing up: though general 64-bit constant for division by 25 does not exist, it exists for dividends less than 0x4000000000000000. – kludg Feb 01 '16 at 10:02
  • 1
    @user246408: thanks for that analysis. I didn't take the time to grok the whole picture, I mostly just wanted to contribute a "look what the compiler does" answer, without taking the time to understand exactly why the compiler was able to do that. – Peter Cordes Feb 01 '16 at 10:05
  • Would be nice if you add the gcc code for `div_by_25`. If my analysis is correct, it should be different from `div_by_100` because add-and-shift thing is needed. – kludg Feb 01 '16 at 10:48
  • 1
    @user246408: godbolt makes it super-easy for anyone to go and try code changes and see the same asm output that I copy/pasted. That's why I put godbolt links in all my asm-output answers. Anyway, here's a link with [div_by_100 and div_by_25](http://goo.gl/z3Bn05). You're right, it shifts by one and adds. The constant it uses is `0x47ae147ae147ae15`, to save the trouble of flipping godbolt to "binary" mode to get hex. Interestingly, ICC13 uses that constant for both `div_by_100` and `div_by_25`. Instead of multiplying by 25 or 100, it multiplies by `-25` or `-100`. – Peter Cordes Feb 01 '16 at 11:19
1

I found the solution with libdivide.h. Here is the slightly more complicated part for Win64:

{$ELSE WIN64}
asm
        .NOFRAME

        MOV     RAX,[RCX]
        MOV     R8,RAX
        XOR     RDX,RDX
        MOV     R9,Div100Const       // New: $47AE147AE147AE15
        MUL     R9                   // Preliminary result Q in RDX

        // Additional part: add/shift

        ADD     RDX,R8               // Q := Q + X shr 1;
        RCR     RDX,1

        SHR     RDX,Div100PostShift  // Q := Q shr 6;
        MOV     [RCX],RDX            // X := Q;

        // Faster than LEA and SHL

        MOV     RAX,RDX
        MOV     R9D,100
        MUL     R9
        SUB     R8,RAX
        MOV     RAX,R8         // Remainder
end;
{$ENDIF WIN32}
Rudy Velthuis
  • 28,387
  • 5
  • 46
  • 94
  • Why not `(Q + X) shr 1` instead of `Q + (X - Q) shr 1`? – kludg Jan 30 '16 at 16:31
  • Hmmm... I took this from libdivide.h. I guess the intermediate result might overflow, for some values. I can try, though. If necessary, I can try to use `RCR` instead. Thanks for the hint. – Rudy Velthuis Jan 30 '16 at 17:25
  • @user246408: You are right. I used RCR to shift back in the carry in case of an overflow, and it is slightly simpler and faster now. I edited the answer. – Rudy Velthuis Jan 30 '16 at 17:55
  • Interesting trick with `rcr` to get an extra bit of intermediate precision. It's a 3 uop instruction on Intel SnB-family, (replacing three 1uop insns) so the change doesn't save any uops there. It's only 1 m-op on AMD, though, so it saves two macro-ops there. Note that `rcr` by an immediate count other than 1 is significantly slower, so it's not useful even if you could combine its right-shift with the following `shr`. `rcr` is 2c latency (Intel), and the `mov` wasn't on the critical path (and is zero latency anyway on IvB and later), so again it's a wash. (sub and shl r,1 are 1c) – Peter Cordes Jan 31 '16 at 15:44
  • @PeterCordes: So `MOV R9,R8; SUB R9,RDX; SHR R9,1; ADD RDX,R9` is equivalent (qua timing, I mean -- I know it gives the same result) to the above while it avoids the "65th bit"? – Rudy Velthuis Jan 31 '16 at 19:30
  • @RudyVelthuis: on Intel SnB-family, yes same uop count and latency. Maybe different port requirements. (e.g. IvB and later don't need a port for the mov). On Pentium-M to Nehalem, it's 2 uops (still with 2c latency). On AMD (and PII/PIII), the `add` / `rcr 1` is faster. On Silvermont, `rcr 1` is 7uops (while simple instructions are still 1). I've seen the `Q + (X - Q)/2` idiom before, in C, for calculating an average while avoiding overflow/carry. Anyway, you've stumbled onto another one of those "faster on one CPU, slower on others" cases. – Peter Cordes Jan 31 '16 at 23:01
  • You don't need to zero `rdx` before `mul`. It's a write-only operand for `mul`, unlike `div`. And like I pointed out in my answer, `imul rax, rdx, 100` is better than `lea` / `lea` / `shl`. [clang even uses it](http://goo.gl/dDV267). Clang's version is 9 uops (Intel Haswell) to your 12 (not counting your loads and stores or the wasted xor). – Peter Cordes Feb 01 '16 at 06:30
  • 1
    ^Yes. There are 2 improvements possible: (1) instead of zeroing `RDX` move the `Div100Const` in `RDX` and `MUL RDX`; (2) if `RCR` is slow then shift the dividend 1 (or 2) bits right and divide by 50 (or 25) - there would be no overflow so no need for `RCR`. – kludg Feb 01 '16 at 07:11
  • @user and Peter: I keep forgetting that I don't have to zero out RDX before a mult, only before I start a division chain. – Rudy Velthuis Feb 01 '16 at 10:45
1

The code in the @Rudy's answer results from the following steps:

  1. Write 1/100 in binary form: 0.000000(10100011110101110000);
  2. Count leading zeroes after the decimal point: S = 6;
  3. 72 first significant bits are:

1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101

  1. Round to 65 bits; there is some kind of magic in how this rounding is performed; by reverse-engineering the constant from the Rudy's answer the correct rounding is:

1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 1

  1. Remove the leading 1 bit:

0100 0111 1010 1110 0001 0100 0111 1010 1110 0001 0100 0111 1010 1110 0001 0101

  1. Write in hexadecimal form (getting back the revenged constant):

A = 47 AE 14 7A E1 47 AE 15

  1. X div 100 = (((uint128(X) * uint128(A)) shr 64) + X) shr 7 (7 = 1 + S)

kludg
  • 27,213
  • 5
  • 67
  • 118
  • I actually did something similar for 32 bit, but simpler. I divided `$100000000` by `100` (I actually started with `div 25` and then `shr 2`), and tried this with all cardinals. When that did not succeed, I used `$200000000` and one shift extra, and repeated this until I found what I needed. – Rudy Velthuis Jan 30 '16 at 21:48
  • FWIW, the rounding used seems to be quite simple: round up (away from zero). But why is the leading one bit removed? I still think that `magic = $3FFFFFFFFFFFFFFFFF div 100 + 1` (you have that too: `$A3D70A3... etc.`) and a shift of 6 should work too. I just don't know how to prove or test it reliably. – Rudy Velthuis Jan 30 '16 at 21:53
  • FWIW, their constant is `$1C0000000000000000 div 100`, mine is `$400000000000000000 div 100`. $1C = 28. So that is (28/64) * (64/100) = 28/100. Add to that 100/100 (X) and you have 128/100th of X. Shift that right once, and you have 64/100th. Shift that right 6 times and you get 1/100th. ISTM that my constant (`$A3D70...`) should work without the add/shift and give the exact same result. Also note that 64bit x 64bit can never overflow into 129 bits or some such, so there is no need for trickery like (Q + (X - Q) shr 1). – Rudy Velthuis Jan 31 '16 at 00:15
  • @RudyVelthuis the leading 1 bit is removed because it is accounted by adding `X` in the final formula; the trick of adding `X` means that we use 65-bit constant instead of 64-bit as you tried in you first attempt; 64-bit constant is not enough to produce always correct division result in 64-bit case; I also not sure that 65-bit constant is enough in 64-bit case. – kludg Jan 31 '16 at 04:25
  • Hmmm... a 32 bit constant is good enough to reliably get the correct result in 32 bit. Perhaps not for all values, but apparently for div 100. I understood that X is added in. – Rudy Velthuis Jan 31 '16 at 19:25
  • FWIW, instead of 100, I could use 25 for division and then shift right by 2 later on. That would avoid the 65th bit and would allow an even more accurate constant. I'll have to try that. – Rudy Velthuis Jan 31 '16 at 19:35
  • Hmmm... libdivide gives the same constant value and code and only a shift of 4 instead of 6 for division by 25. Nothing gained. – Rudy Velthuis Jan 31 '16 at 19:38
  • 1
    @RudyVelthuis - looks like 65 bits is always enough; actually (N+1)-bit constant is always enough for N-bit division, while N-bit constant is possible for some divisors only. – kludg Feb 01 '16 at 05:32
  • I'm currently reading this. That seems to be the base for the libdivide code: https://gmplib.org/~tege/divcnst-pldi94.pdf – Rudy Velthuis Feb 01 '16 at 15:38
  • @RudyVelthuis - Have read with this article, some thoughts: https://sergworks.wordpress.com/2016/02/01/integer-division-by-constant/ – kludg Feb 01 '16 at 20:02