12

How to compute the integer division, 264/n? Assuming:

  • unsigned long is 64-bit
  • We use a 64-bit CPU
  • 1 < n < 264

If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:

#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)

unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
    return q + 1;
else
    return q;

Is there any faster (CPU cycle) or cleaner (coding) implementation?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Wu Yongzheng
  • 1,707
  • 17
  • 23

4 Answers4

14

I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.

phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.

uint64_t divide_two_to_the_64(uint64_t n) {
  return (-n)/n + 1;
}

The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    div     rdi
    add     rax, 1
    ret
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • 1
    Quite an astute observation. It seems to hold with an `if (twoPow64divphuclv (v) != twoPow64divnate (v)) { /* throw error */ }` for as many numbers as I was willing to wait through `:)` Though I did make yours `uint64_t twoPow64divnate (uint64_t n)` – David C. Rankin Apr 09 '19 at 04:14
  • @DavidC.Rankin if you can do 1 billion iterations a second, it'll take you ~585 years to loop through all possible `int64_t` values, so unfortunately you have to test a 32-bit version and then check if it work for some 64-bit values – phuclv Apr 09 '19 at 09:55
  • nice but wouldn't it be better to use standard `uint64_t` instead of `unsigned long` ? unsigned long isn't guaranteed to be 64 bits, even on a 64 bits compiler – Jean-François Fabre Sep 14 '20 at 19:27
  • 1
    @Jean-FrançoisFabre: Sure, if you like. The question explicitly stated their assumption that `unsigned long` is 64 bits on the system of interest, so I was just staying consistent with that. But I've changed it now. – Nate Eldredge Sep 14 '20 at 19:29
7

I've come up with another solution which was inspired by this question. From there we know that

(a1 + a2 + a3 + ... + an)/n =

(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n

By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have

(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n

= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n

= (264 - n)/n + ((264 - n) % n)/n

264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be

uint64_t twoPow64div(uint64_t n)
{
    return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}

The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution

On MSVC however there's an intrinsic for 128-bit division, so you can use like this

uint64_t remainder;
return _udiv128(1, 0, n, &remainder);

which results in the cleanest output

    mov     edx, 1
    xor     eax, eax
    div     rcx
    ret     0

Here's the demo

On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these

(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)

although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation

There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have

264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1

which is essentially just another way to get Nate Eldredge's answer

Here's some demo for other compilers on godbolt

See also:

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 1
    Since `(-n) /n` equals (2^64-n)/n = 2^64/n - 1, can't we just say `(-n)/n + 1`? It seems to give the right answer in all of my tests. Am I missing something? – Nate Eldredge Apr 09 '19 at 03:37
  • @NateEldredge good point. I didn't think of that but it seems reasonable – phuclv Apr 09 '19 at 09:46
2

We use a 64-bit CPU

Which 64-bit CPU?

In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.

Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".

In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.

Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".

Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.

Hopefully you can see how this situation might be considered "less than ideal".

What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:

  __uint128_t numerator = (__uint128_t)1 << 64;
  if(n > 1) {
      return (uint64_t)(numerator/n);
  }

I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).

In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.

For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    div     rdi
    add     rax, 1
    ret

...and the best code that's possible is:

    mov     edx, 1
    xor     rax, rax
    div     rdi
    ret
Brendan
  • 35,656
  • 2
  • 39
  • 66
  • A minor disadvantage (or maybe advantage?) of the "best possible" code is that if you should accidentally try it with `n=1`, you will get a divide error exception. The other suggestions all return 0. – Nate Eldredge Apr 09 '19 at 04:24
  • @NateEldredge: Yes - that's why I needed to put the `if(n > 1)` when testing in godbolt (to make sure it's possible for compiler to know `n` can't be 1). – Brendan Apr 09 '19 at 04:34
  • Current compilers are also not smart enough to convert a division of a `__(u)int128_t` by a constant into a multiplication and calls the division function instead. And instead of `xor rax, rax` you should use `xor eax, eax` – phuclv Apr 09 '19 at 09:50
1

Your way is pretty good. It might be better to write it like this:

return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);

The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.

Compile and disassemble.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Interestingly, gcc 7 and 8 both compile your code with a conditional set, whose result is added. For the original code, gcc 7 does a conditional move. gcc 8 does a very clever `cmp rdi, 1 ; adc rax, 0` (!). – Nate Eldredge Apr 09 '19 at 04:33
  • Wow, that is very clever. I would be happy with any of these results, but your `-n/n+1` is much better, of course -- I'm disappointed that I didn't think of it. – Matt Timmermans Apr 09 '19 at 04:52