what is fastest x86-64 assembly-language divide algorithm for huge numbers?

Question

I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128, s0256, s0512, s1024 signed-integer types and f0128, f0256, f0512, f1024 floating-point types. So far I'm working on the signed-integer types, because the floating-point functions will likely call some internal routines written for the integer types.

So far I've written and tested functions to perform the various unary operators, compare operators, and the add, subtract and multiply operators.

Now I'm trying to decide how to implement functions for the divide operators.

My first thought was, "Newton-Raphson must be the best approach". Why? Because it converges very quickly given a good seed (starting guess), and I figure I should be able to figure out how to execute the native 64-bit divide instruction on the operands to get an excellent seed value. In fact, if the seed value is precise to 64-bits, to get the correct answers should only take:

`s0128` : 1~2 iterations : (or 1 iteration  plus 1~2 "test subtracts")
`s0256` : 2~3 iterations : (or 2 iterations plus 1~2 "test subtracts")
`s0512` : 3~4 iterations : (or 3 iterations plus 1~2 "test subtracts")
`s1024` : 4~5 iterations : (or 4 iterations plus 1~2 "test subtracts")

However, a bit more thinking about this question makes me wonder. For example, I recall the core routine I wrote that performs the multiply operation for all the large integer types:

s0128 :   4 iterations ==   4 (128-bit = 64-bit * 64-bit) multiplies +  12 adds
s0256 :  16 iterations ==  16 (128-bit = 64-bit * 64-bit) multiplies +  48 adds
s0512 :  64 iterations ==  64 (128-bit = 64-bit * 64-bit) multiplies + 192 adds
s1024 : 256 iterations == 256 (128-bit = 64-bit * 64-bit) multiplies + 768 adds

The growth in operations for the wider data-types is quite substantial, even though the loop is fairly short and efficient (including cache-wise). This loop writes each 64-bit portion of the result only once, and never reads back any portion of the result for further processing.

This got me thinking about whether more conventional shift-and-subtract type divide algorithms might be faster, especially for the larger types.

My first thought was this:

result = dividend / divisor                  // if I remember my terminology
remainder = dividend - (result * divisor)    // or something along these lines

#1: To compute each bit, generally the divisor is subtracted from the dividend IF the divisor is less than or equal to the dividend. Well, usually we can determine the divisor is definitely less-than or definitely greater-than the dividend by only inspecting their most-significant 64-bit portions. Only when those ms64-bit portions are equal must the routine check the next lower 64-bit portions, and only when they are equal must we check even lower, and so forth. Therefore, on almost every iteration (computing each bit of result), we can greatly reduce the instructions executed to compute this test.

#2: However... on average, about 50% of the time we will find we need to subtract the divisor from the dividend, so we will need to subtract their entire widths anyway. In this case we actually executed more instructions than we would have in the conventional approach (where we first subtract them, then test the flags to determine whether the divisor <= dividend). Therefore, half the time we realize a saving, and half the time we realize a loss. On large types like s1024 (which contains -16- 64-bit components), savings are substantial and losses are small, so this approach should realize a large net savings. On tiny types like s0128 (which contains -2- 64-bit components), savings are tiny and losses significant but not huge.

So, my question is, "what are the most efficient divide algorithms", given:

#1: modern x86-64 CPUs like FX-8350
#2: executing in 64-bit mode only (no 32-bit)
#3: implementation entirely in assembly-language
#4: 128-bit to 1024-bit integer operands (nominally signed, but...)

NOTE: My guess is, the actual implementation will operate only on unsigned integers. In the case of multiply, it turned out to be easier and more efficient (maybe) to convert negative operands to positive, then perform unsigned-multiply, then negate the result if exactly one original operand was negative. However, I'll consider a signed-integer algorithm if it is efficient.

NOTE: If the best answers are different for my floating-point types (f0128, f0256, f0512, f1024), please explain why.

NOTE: My internal core unsigned-multiply routine, which performs the multiply operation for all these integer data-types, produces a double-width result. In other words:

u0256 = u0128 * u0128     // cannot overflow
u0512 = u0256 * u0256     // cannot overflow
u1024 = u0512 * u0512     // cannot overflow
u2048 = u1024 * u1024     // cannot overflow

My code library offers two versions of multiply for each signed-integer data-type:

s0128 = s0128 * s0128     // can overflow (result not fit in s0128)
s0256 = s0256 * s0256     // can overflow (result not fit in s0256)
s0512 = s0512 * s0512     // can overflow (result not fit in s0512)
s1024 = s1024 * s1024     // can overflow (result not fit in s1024)

s0256 = s0128 * s0128     // cannot overflow
s0512 = s0256 * s0256     // cannot overflow
s1024 = s0512 * s0512     // cannot overflow
s2048 = s1024 * s1024     // cannot overflow

This is consistent with the policy of my code library to "never lose precision" and "never overflow" (errors are returned when the answer is invalid due to precision-loss or due to overflow/underflow). However, when double-width return value functions are called, no such errors can occur.

Yep. Divisions suck. That's why people tend to avoid as much as possible... — Mysticial, Dec 27 '13 at 09:18
Can't you use a divide-and-conquer algorithm and just omit the conquering? — Kerrek SB, Dec 28 '13 at 00:04
I have implemented all the above in C++: 1. A class that supports all arithmetic operations for a natural number of any conceivable size. 2. A class that supports all arithmetic operations for a rational number of any conceivable size. The latter stores the rational number as a pair of natural numbers {enumerator,denominator}, and performs no FP operations whatsoever (except within the class constructors which takes a 'double' and converts it to enumerator and denominator). It is rather slow and inefficient, yet 100% accurate (no precision-loss). If it helps, I can send you the code... — barak manos, Jan 01 '14 at 13:08
Wonderful offer. Maybe later. But more valuable to me (and other readers) is opinions and discoveries of techniques to implement the operations. Besides divide, sin(), cos(), tan(), cot(), sec(), csc(), asin(), acos(), atan(), sinh(), cosh(), tanh(), asinh(), acosh(), atanh(), ln(), log(), exp(), power(), print(), etc? I almost chose variable length operands, but decided on fixed sizes: the largest supports enough precision and range for ALMOST any conceivable purpose. My code contains no floating-point (but implements floating-point). Where do you get monster constants like e, pi, etc? — honestann, Jan 01 '14 at 20:03
@Mystical: "divisions suck"? Wow. In computing, one wants to work with numbers, formulas and their algebraic properties. "division" is an inverse operation of multiplication, the analog to subtraction being the inverse of addition. I can't imagine doing serious computing without subtraction; that's because inverses are really useful. Thus for division. Just because it is *inconvenient* to implement isn't a reason to avoid it; its just a reason to implement it well. — Ira Baxter, Jan 02 '14 at 05:34
@IraBaxter: I've had very good results with "arbitrary size rational numbers". For this a number is represented as `numerator/divisor * 2**exponent`, and you never need to divide (you multiply by reciprocal instead). Of course you probably will want to be able to simplify your rational numbers (find greatest common divisor and use it to reduce numerator and divisor), but that's optional and needn't be anywhere near a critical path. — Brendan, Jan 16 '14 at 08:52
@Brendan: I interpret you answer as "division is still important; I'm ust using reciprocals and GCD as in implementation". It is still inconvenient, but as you've noted, you do it once and put it into a library and then you don't have to think about it any more. [We've also built a rational package, but didn't see any value in the "exponent" part (your power of two). We also have an arbitrary precision integer package which offers, among other things, "divide". — Ira Baxter, Jan 16 '14 at 16:03

Ira Baxter · Accepted Answer · 2014-01-02T04:35:25.910

6

Surely you know about the existing arbitrary precision packages (e.g, http://gmplib.org/) and how they operate? They are generally designed to run "as fast as possible" for arbitrary precisions.

If you specialized them for fixed sizes (e.g, applied [manually] partial evaluation techniques to fold constants and unroll loops) I'd expect you to get pretty good routines for specific fixed-size precisions of the kind you want.

Also if you haven't seen it, check out D. Knuth's Seminumerical Algorithms, and oldie but really goodie, which provides key algorithms for multi-precision arithmetic. (Most of the packages are based on these ideas but Knuth has great explanations and an awful lot right).

The key idea is to treat multi-precision numbers as if they were very-big-radix numbers (e.g., radix 2^64), and apply standard 3rd-grade arithmetic to the "digits" (e.g. 64-bit words). Divide consists of "estimate quotient (big-radix) digit, multiply estimate by divisor, subtract from dividend, shift left one digit, repeat" until you get enough digits to satisfy you. For division, yes, its all unsigned (doing sign handling in wrappers). The basic trick is estimating-a-quotient digit well (using single precision instructions the processor provides you), and doing fast multi-precision multiplies by single digits. See Knuth for details. See technical research papers on multi-precision arithmetic (you get to do some research) for exotic ("fastest possible") improvements.

edited Jan 02 '14 at 04:35

answered Jan 02 '14 at 04:19

Ira Baxter

93,541
22
172
341

I get the basic idea behind multi-word division but with multi-bit digits things seem to get subtle rather quickly. I remember being baffled by even the simplest two-word case in Hacker's Delight. Is the TAOCP coverage suitable reading material for those of us with a less-than-ideal CS background who'd still like to understand the method in depth? – doynax Jan 02 '14 at 09:13
Like all technical documents, whether you have "ideal" CS background or not, careful study is typically required. TAOCP is readable by an undergrad with no education... at least the practicioner parts, which is what you want. (Yes, it has lots of deep math in it, too, but you can skip most of that). I read it in 1969 as an undergraduate with no CS background. My guess is you'll do fine, esp. if your goal is to accomplish what you said. Given your implied background, I suggest you *really* carefully read about partial evaluation. It is truly simple, truly powerful idea for optimizing code. – Ira Baxter Jan 02 '14 at 10:25
Thanks. Time for a trip to the local university library then. To the unpracticed eye partial evaluation appears like another guise of the programming-as-an-exercise-in-caching and precalculate-what-can-be-precalculated schools of thought, but deeper study of the formalism may reveal unexpected applications. (Oh, and I think you may be confusing me with the OP.) – doynax Jan 02 '14 at 10:48
@doynax: Partial evaluation is rather more about precalculating a specialized algorithm from an already-provided more general one. [Yes, I did confuse you with the OP. Oh, well. Hope you and he both find the advice useful :-}] – Ira Baxter Jan 02 '14 at 10:54
How to learn "how the gmplib" packages operate? Do they offer a PDF document? Comments in code? What? I downloaded and read a huge pile of research articles about this (divide) issue, but they are amazingly not helpful for figuring out which approach or algorithm is best for various operand sizes (or fixed vs variable size). Some articles even show graphs of NORMALIZED performance for each algorithm for different widths, but no way to compare performance of the two algorithms, even though they graph both of them! How stupid is that? Must one be impractical/clueless to perform research? – honestann Jan 03 '14 at 02:25
I don't understand your divide technique, and I've never heard anyone propose this for divide. This **is** how I do multiply on my large types, but multiply naturally works this way. My first thought is dividing something like 87654321 by 1234. If we assume each decimal digit is our native CPU type (64-bits) by analogy, then we must first divide 8 (or 87654321) by 1. The obvious result is 8. However, that denominator could be 1000 or 1999, which means the correct result could be 4, 5, 6, 7 or 8. Trying to figure how to fiddle around to correct all these mistakes seems impractical. No? – honestann Jan 03 '14 at 02:31
@honestann: Read the Knuth books; "my" algorithm is really his, and he discusses how to choose the correct quotient digit, including that "impractical" fiddling around you seem to be dismissing. – Ira Baxter Jan 03 '14 at 05:58
1

@IraBaxter: Well, confusing! I get symbol overload quickly, and the code (though short) is... ugly and confusing. However, I sat down and divided numbers I chose at random (CD53A673 / A4B2) on paper with the radix being a hex digit - which I am smart enough to handle and check. And indeed, it worked perfectly. I'm not quite sure why decimal has that horrible situation (possible results of 4, 5, 6, 7, 8 to test and fix at a single step), but this hexadecimal example was only off by 1, and only at one of several steps. So I'll forget trying to read hyper-math-ridden papers and just do it! – honestann Jan 04 '14 at 23:32
@IraBaxter: The code was obscure, but I noticed some of the problem was performing steps in C that are only natural and efficient in assembly language, especially with modern CISC CPUs like x86_64. The little pile of pseudocode I wrote is so vastly simpler, but I guess most people are trained like pavlovian monkeys to formulate everything in terms of math, whether necessary, appropriate or neither. Thanks for pointing me at this technique - looks like the winner... especially if I can find a way to perform useful work during the 72 cycles after executing divide and before the result appears! – honestann Jan 04 '14 at 23:43
I believe Knuth shows if you compute a quotient digit by dividing a fixed-size-prefix (1 or 2 "digits") by the leading "digit" of the divisor, your answer is off by at most "1"; you need a corrective check and possibly and extra subtraction IIRC to handle this case. (Its been 30 years since I code a multiprecision divide, forgive my memory lapse). The point is that Knuth explains all this, both in in an intuitive way, and then again with big hammer mathematics to show it really works. Even now I go back to look in his books. – Ira Baxter Jan 05 '14 at 00:35
72 cycles? Yeah, even single precision divides are pretty expensive. If you want to divide a lot by a known value (e.g., the "first" divisor digit), one thing you can do compute its inverse *once* (Newton Raphson might be good for this part), and then *multiply* by the reciprocal. Lots faster than a divide, since you amortize the reciprocal computation over many quotient digits. Knuth doesn't say this; he didn't get everything right in 1969. – Ira Baxter Jan 05 '14 at 00:36
1

@IraBaxter: Practical observations. I tried my algorithm on other numerators & denominators ("n" and "d"). The problem I noted still exists. If we divide (in hexadecimal this time) a numerator with ms4 = F and denominator with ms4 = 1, the result can be F, E, D, C, B, A, 9 or 8 (depending on the other n and d bits). Which is why we must FIRST "normalize" [n and] d so their msb == 1 are aligned. Now n and d inherently differ by a factor of **less than** 2, and we can do all the **REMAINING** operations in terms of 64-bit elements. But that first step must be in binary, to the bit. – honestann Jan 06 '14 at 04:04
@IraBaxter: We can always divide 1 by any denominator with this big radix algorithm too, except replace 1 with 1 << b, where b is the total number of bits in the data-types (in my case, 128, 256, 512 or 1024 bits). Thus we compute the reciprocal times 2^b which we can work with until some appropriate later time we can fix the result ("r") like this: r = r >> b. So, whether we adopt reciprocals or not (for the reason you state), we can still adopt either this "big radix" technique, or newton-raphson. Off hand "big radix" seems much faster for 512 & 1024 bits, and maybe 256 bits. – honestann Jan 06 '14 at 04:14
@IraBaxter: The notation in D3 is too abstract for me to comprehend, but seems to indicate that the predicted result (quotient) can be 2 too high, not just 1 too high. But it intuitively seems to me that this possibility is so astronomically tiny that contriving numerators and denominators to expose that case might be extremely difficult. So I guess that "check and correct" step needs to be a loop in which the result of subtracting the predicted (result * denominator) from the running remainder must be checked after the subtract "just in case", and repeated once every r 2^64 loops or so. – honestann Jan 06 '14 at 05:08
@IraBaxter: I see how the result can be larger than the denominator, but cannot see how the result can be larger than the largest numerator (which every source claims is possible). How can this be the case when the largest result is the divide-by-one case, which generates a result exactly EQUAL TO the numerator? – honestann Jan 06 '14 at 05:11

score 1 · Answer 2 · answered Jan 16 '14 at 06:12

The "large-radix" approaches are more efficient for the kinds of huge data-types you mention, especially if you can execute 128-bit divided by 64-bit instructions in assembly-language.

While Newton-Raphson iteration does converge quickly, each iteration requires too huge a number of multiply and accumulate steps for each iteration.

score 1 · Answer 3 · edited Apr 22 '21 at 01:06

For the multiplication, have a look here:

~~http://www.math.niu.edu/~rusin/known-math/99/karatsuba~~ http://web.archive.org/web/20141114071302/http://www.math.niu.edu/~rusin/known-math/99/karatsuba

Basically, it allows doing a 1024 x 1024 multiplication using three (instead of four) 512 x 512 bit multiplications. Or nine 256 x 256 bit, or twenty-seven 128 x 128 bit. The added complexity might not beat brute force even for 1024 x 1024, but probably for bigger products. That's the simplest of the "fast" algorithms, using n ^ (log 3 / log 2) = n^1.585 multiplications.

I'd advice to not use assembler. Implement 64 x 64 -> 128 bit multiplication with inline assembler, same with add-with-carry (I think gcc and clang might have built-in operations for this nowadays); then you can for example multiply n bits x 256 bits (any number of words times 4 words) in parallel, avoiding all the latency of multiplication, without going mad with assembler.

Thanks for your 3 new messages. I already wrote implementations for some of my routines, then got sidetracked (before I got to the f0128, f0256, f0512, f1024, f2048, f4096 floating-point versions). But I did finish the integer multiplies and divides. The multiply loops are table driven; each step is 64-bit * 64-bit ===>> 128-bit result (then add that in). The divide is done in 64-bit chunks too, similar to the old fashion school method. If you want to compare the speed of these routines to yours, I can send my code (all 64-bit att/gas/linux assembly). — honestann, Mar 19 '14 at 01:51

score 0 · Answer 4 · answered Mar 18 '14 at 19:04

For large number of bits, I learned the quickest algorithm goes like this: Instead of dividing x / y, you calculate 1 / y and multiply by x. To calculate 1 / y:

1 / y is the solution t of (1 / ty) - 1 = 0.
Newton iteration: t' = t - f (t) / f' (t) 
  = t - (1 / ty - 1) / (-1 / t^2 / y)
  = t + (t - t^2 y)
  = 2t - t^2 y

The Newton iteration converges quadratically. Now the trick: If you want 1024 bit precision, you start with 32 bit, one iteration step gives 64 bit, next iteration step gives 128 bit, then 256, then 512, then 1024. So you do many iterations, but only the last one uses full precision. So all in all, you do one 512 x 512-> 1024 product (t^2), one 1024 x 1024 -> 1024 product (t^2 y = 1 / y), and another 1024 x 1024 product (x * (1 / y)).

Of course you have to figure out very precisely what the error is after each iteration; You'll probably have to start with say 40 bit, lose a bit of precision in each step so you have enough at the end.

I have no idea at which point this would run faster than a straightforward brute-force division as you learned it at school. And y may have fewer than the full number of bits.

score 0 · Answer 5 · answered Mar 18 '14 at 19:14

The alternative is brute force. You could take the highest 128 bits of x, divide by the highest 64 bits of y, and get the highest 64 bit r of the quotient, then subtract r x y from x. And repeat as needed, carefully checking how big the errors are.

Divisions are slooow. So you calculate 2^127 / (highest 64 bits of y) once. Then to estimate the next 64 bit, multiply the highest 64 bit of x with this number and shift everything into the right place. Multiplication is much faster than division.

Next you'll find that all these operations have long latencies. For example, 5 cycles to get a result, but you could do a multiplication every cycle. So: Estimate 64 bit of the result. Start subtracting r * y at the high end of x, so you can estimate the next 64 bit as quickly as possible. Then you subtract two or more products from x simultaneously, to avoid the penalty from latency. Implementing this is tough. Some things may not be worth it even for 1024 bits (which is just sixteen 64-bit integers).

what is fastest x86-64 assembly-language divide algorithm for huge numbers?

5 Answers5