What algorithm should I use for high-performance large integer division?

Question

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).

I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.

My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?

Or is there a dedicated division algorithm that is even better than the inverse-based approach?

Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.

Do you know the asymptotic complexity of those methods offhand? In terms of the digit count passed into the function? To compare to the O(n^2) of tabletop multiplication, etc. — VoidStar, Oct 05 '15 at 07:17
`O(n*log(n))` sounds too fast, that's faster than the fastest multiplication. I suspect it turns out to be a bit slower for some reason, but I'll get back to you if I can figure out why. — VoidStar, Oct 05 '15 at 07:48
moved comments to answer, added binary long division example with some info ... — Spektre, Oct 05 '15 at 09:40

score 4 · Answer 1 · 2015-10-06T14:42:27.477

4

The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.

Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.

When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.

The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.

In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that

x = qy + r

but without the restriction that 0 <= r < y. The typical loop is

Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.

The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.

Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.

The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.

The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is

Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)

In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.

The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.

The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.

edited Oct 06 '15 at 14:42

answered Oct 05 '15 at 07:33

I wonder if they rejected Knuth's suggestion, or just didn't know about it... It's going to take a while for me to decide. – VoidStar Oct 05 '15 at 08:00
@VoidStar You should try to write to the authors of the library and ask; they might be willing to discuss this if you are lucky. – Oct 05 '15 at 08:18
Thanks, I sent them an email on gmp-discuss. – VoidStar Oct 05 '15 at 09:38
@VoidStar: Although I don't have Knuth handy, I believe algorithm R is just Newton's algorithm for computing the inverse, it's the method you want to use to do the 'pre'computation step of Barrett reduction. – Oct 05 '15 at 10:06
@Hurkyl: So Barrett reduction is just a way of utilizing the inverse? Why not simply multiply by it? If you have a real inverse that you can multiply by to get the answer, I don't see what the point of Barrett reduction is. Although I'm unclear on Barrett reduction in this context anyway, the definition of that suggests it is for modular arithmetic (I'm doing plain integer division without modulus). – VoidStar Oct 06 '15 at 06:40
@VoidStar: Yeah. The main thing is that I find these sorts of things much easier to work with and analyze if you stick to integers everywhere (e.g. computing `floor(2^k/d)` rather than `1/d` to some amount of precision). Also, even if all you want is the quotient, you still need to compute the reduction anyways, because accumulated error means the quotient you compute can be off by 1 or 2 of the true quotient. (also, the intermediate step of long division requires you to compute the reduction anyways) – Oct 06 '15 at 07:30
Thanks, that makes sense. Although in the case of the Knuth algorithm, his algorithm computes the the inverse to such a precision that it is accurate enough to multiply by to get the quotient... he provides a proof that this is the case. The main question is regarding the barrett reduction time complexity... Can it compare with the `O(n logn loglogn)` that you get out of Knuth's division via inverse w/ schonhage-strassen? Or does it shine with smaller inputs? – VoidStar Oct 06 '15 at 07:59
Yes; you need a lot of precision to do it that way. To compute a `2n x n` quotient, I think you need `2n` bits of precision on the inverse, and then have to compute a full `2n x 2n` multiplication in order to get the quotient. This is more expensive than computing `n` bits of precision on the inverse, `2n x n` multiplication to estimate the quotient, then an `n x n` product to compute the reduction. That works out to something like `6 M(n)` versus `4.5 M(n)`. – Oct 06 '15 at 08:26
If you compute `n/2` bits of precision on the inverse and do two steps of reduction (i.e. long division), that requires two `n x n/2` products to estimate quotients and two `n x n/2` products to compute the reductions, which works out to `4 M(n)`. (none of these costings take into account tricks to reduce the cost when you know part of part of the product already) (these numbers are assuming we're in the FFT range for doing multiplications, so that `M(2n)` is about the same cost as `2M(n)`, and that you can do an `2nxn` multiply with cost `M(1.5 n)`) – Oct 06 '15 at 08:30
Hmm, as I look at this reciprocal more, you'd need to be in the range of like 100 times normal "FFT range" size (a.k.a. ludicrously huge) before the complexity gains would really kick in (it involves lots of repeated multiplication). I'm still struggling to figure out the exact time complexity of Barrett here (I assume its between `nlogn` and `n^2`), but I bet it does work better in practice. Now I just need to figure out the C code for it. – VoidStar Oct 06 '15 at 09:08
@VoidStar: You have to carefully do each step with only just the precision you need rather than full precision -- each step of the inverse calculation is done with twice the precision of the previous step. If done right, the entire calculation should only cost about twice as much as the last step. – Oct 06 '15 at 09:32
@Hurkyl: You're right, that's exactly what Knuth's writing depicts, it's just that its still many times slower than multiplication overall, including mutliplying at the end to get the remainder. Also... unfortunately the Barret reduction algorithm [here](https://en.wikipedia.org/wiki/Barrett_reduction) carries the assumption that when dividing `a/n`, `a` has to be less than `n^2`. That's a complication... – VoidStar Oct 06 '15 at 09:40

score 3 · Answer 2 · edited May 23 '17 at 12:33

I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.

I do bigint divisions a bit differently.

See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.

I'm not sure if the inverse multiplicant is the way.

It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.

The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.

Approximation/iterative dividers usually use multiplication which define their speed.

For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)

Binary division example:

Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.

Here example of binary division from one of my bigint templates (C++):

template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
    {
    int i,j,sh;
    sh=0; c=DWORD(0); d=1;
    sh=a.bits()-b.bits();
    if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
    for (;;)
        {
        j=geq(a,b);
        if (j)
            {
            c+=d;
            sub(a,a,b);
            if (j==2) break;
            }
        if (!sh) break;
        b>>=1; d>>=1; sh--;
        }
    d=a;
    }

N is the number of 32 bit DWORDs used to store a bigint number.

c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b

The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)

If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics

On top of all that while computing with bignums

If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.

The overhead can mess thing up.

Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

In Big O notation, you should always strip scalar constants. `O(log32(n))` = `O(log(N))` because they are irrelevant to describing the rate of growth. Secondly, Big O is most useful and most commonly phrased in terms of the number of bits in the input. So digit count is what you should base this on instead of the size of the value that can be processed. What you've shown is an `O(n^2)` algorithm, which is passable, but with Knuth's high-speed reciprocal combined with a fast multiplication, it's possible to be faster (with ridiculously large inputs. yours is great for medium-sized stuff). — VoidStar, Oct 05 '15 at 09:48
@VoidStar in tat case the result is in `O(n^2)` for binary long division — Spektre, Oct 05 '15 at 09:50
@VoidStar Out of curiosity, what do you mean by "ridicolously large" and "medium-sized"? How many digits? — Fabio says Reinstate Monica, Oct 05 '15 at 10:09
@FabioTurati that depends on the implementation ... for example see [fast bignum sqr](http://stackoverflow.com/q/18465326/2521214) the NTT based `sqr` treshold of mine impementation is `310*32=9920` bit of operand (19840 bit of result) and NTT `mul` have 1396*32=44672 bit of result which are really huge numbers ... When you change implementation (optimize or whatever the tresholds can change the same goes for changing platform of computation) — Spektre, Oct 05 '15 at 11:40

What algorithm should I use for high-performance large integer division?

2 Answers2

Linked