2

I have a very big unsigned binary big integer in the scale of 6*10^120. Assume the big integer is stored in a struct of many QWORD (8 bytes) unsigned integers or several YMM registers.

I want to display it in decimal (not binary) scientific notation like 6E120. The mantissa is always 1 digit and must be the leading digit of the full decimal representation; truncate it to 1 significant digit, not rounding to nearest. The exponent is always 3 digits. The format is aExyz like 8E095.

What is the most time-efficient (fastest) algorithm to find the order of magnitude (power of 10) and leading decimal digit? I am asking for the algorithm, not for the program, I will write it myself.

It will be done in MASM64 assembly language. If there are instructions that can help like bit manipulations or FPU/SSE/AVX512 tricks, please do suggest them.

This is not a high-level program so any responses that include third-party libraries or high-level language constructs are not helpful. I know a certain algorithm involves many divisions. Those are expensive in ASM so I'm looking for an alternative. I know how to convert from binary to decimal and then to scientific notation. I'm trying to avoid that middle step.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Danny Cohen
  • 77
  • 10
  • 1
    If it's not a round number, like not *exactly* 6e120, do you want to round to just one significant digit? That could probably be much faster, based on finding the leading 1 bit, a lookup table, and maybe some comparison or repeated subtraction. (I'm thinking of extending the algorithm for quickly finding the number of decimal digits needed for a binary integer, from 0 significant figures to 1. [How to retrieve the first decimal digit of number efficiently](https://stackoverflow.com/a/17394218)) – Peter Cordes Mar 03 '23 at 08:19
  • 2
    [What's the fastest way to obtain the highest decimal digit of an integer?](https://stackoverflow.com/q/38633037) suggests clz then one division, but that might not work well for bigints. Still, it's much better than peeling off digits one at a time starting from the lowest, like normal int->string when you need all the decimal digits. – Peter Cordes Mar 03 '23 at 08:20
  • 1
    [Count number of digits - which method is most efficient?](https://stackoverflow.com/a/9721570) is another version of the exponent-finding. Also [Get number of digits in a base10 integer](https://stackoverflow.com/q/75542935) avoids a LUT, but linear searches for a power of 10 larger than the input. Binary search could work, or to avoid division / recomputing, going by factors of 10^6 or 10^12 or something. (Multiplying a many-limb number by a single-limb number is not too bad, and perhaps a rough approximation based on `clz` can get to the right ballpark) – Peter Cordes Mar 03 '23 at 08:30
  • 2
    Assuming you are rounding to one significant figure, how much rounding error is acceptable? Like, if you round by looking only at the qword containing the highest set bit, not looking at lower qwords, you might round towards a value that's not the nearest, so you might get a worst case rounding error somewhat greater than 0.5 in the last (and only) digit. That could maybe save having to compute the full binary values of big powers of 10, if it's acceptable. Perhaps making a lookup table small enough to be usable? A LUT could be good if you use this very frequently, otherwise just compute. – Peter Cordes Mar 03 '23 at 08:38
  • Thank you for all your comments. To make things clear, the big binary number is an integer so no rounding is involved. All digits can be different so it is not just a bunch of zeros. I know how to convert from binary to decimal and then from decimal to scientific notation. I'm trying to avoid that middle step and convert directly from binary to scientific notation. If by rounding you meant the accuracy of the mantissa and exponent, so yes, they have to be exact. The mantissa is always 1 character and the exponent is 3 characters. If there is no way, I'll accept it as an answer. – Danny Cohen Mar 03 '23 at 08:59
  • Eventually, I will implement the algorithm in MASM64. So if there are any advanced instructions that can help like bit manipulations or AVX tricks, feel free to use them. – Danny Cohen Mar 03 '23 at 09:17
  • 2
    So you do only ever want to print one digit for the significand. Not like 5.99e120. Printing 6e120 when the actual value is 6e120 - 1 *is* rounding, and 6e120 is not an "exact" representation of the binary integer. I think you mean it has to be "correctly rounded", so you always print the nearest representable 1digit E 3digit value. (I'm assuming you want round-to-nearest-even, but truncation toward 0 would also be possible and easier. So 6e120 - 1 would print as 5e120.) Anyway, I was assuming you were rounding to a 1 digit mantissa, that's what those links are about. – Peter Cordes Mar 03 '23 at 09:18
  • Yes Peter, truncation is allowed. Rounding is not allowed. The output format will always be aExyz like 8E096. – Danny Cohen Mar 03 '23 at 09:20
  • 2
    Truncation is a rounding mode, rounding toward 0. So if the exact value was 9999999999999999999999999999999999999999999999999999999999999999, you'd want to print 9E064 instead of 1E065, truncating toward zero? That makes it much easier, you probably can just look at the leading qword most of the time, only needing to look lower as a tie-break if it matches that of some 1-digit multiple of a power of 10. If you wanted to avoid rounding, you'd have to print the entire value, so the value you print out could be read back in to get the same binary bigint. – Peter Cordes Mar 03 '23 at 09:23
  • Yes Peter, 9E064 is the needed format. 1E065 is a bigger number. – Danny Cohen Mar 03 '23 at 09:25
  • 1
    Ok then yeah, you just need the highest decimal digit of the exact representation, and a count of the decimal digits. Truncation should be more efficient than any other rounding mode. Either with division by a power of 10, or comparisons against 1..9 times a power of ten of the right magnitude, or repeated subtraction. You should update your question with those details. – Peter Cordes Mar 03 '23 at 09:31
  • 2
    re: efficiently finding the MSB of the binary number: [Efficiently find least significant set bit in a large array?](https://stackoverflow.com/q/67605508) . Easily adaptable to search the other direction, using `lzcnt` instead of `tzcnt` on a `vpmovmskb` or `vmovmskps` result. Or on an AVX-512 `k` register after `kmovq rax, k1`. (Or perhaps `bsr` to directly get an element index, although `bsr` is slower on AMD CPUs than `lzcnt`, but same speed on Intel. `lzcnt` on Skylake fixes the false dependency it had since SnB, but `bsr` always depends on the output register.) – Peter Cordes Mar 03 '23 at 09:53
  • 2
    @DannyCohen: What's the biggest value/bitcount you have to deal with? Do you ever have more than 2 YMM registers (1E154)? 3 YMM registers (1E231)? – Mooing Duck Mar 03 '23 at 23:17
  • I require up to 128 decimal digits. so 9E127 is the largest number. I haven't implemented the bigint struct yet. Using 7 QWORD or several YMM registers is currently an idea. – Danny Cohen Mar 04 '23 at 04:27

1 Answers1

4

Assuming the maximum possible value is less than 1E154, and thus all values fit in 512 bits, than I suspect the answer might be:

  1. Precalculate all possible powers_of_10 in a static const array. (#ops ~0)
    • If all numbers fit in 2 YMM registers, that means the max supported value is 1E154, which means the lookup table of powers of 10 would take ~9856 bytes.
  2. In your number, calculate the number of leading binary zeroes (Using clz as well) (#ops < ~10)
  3. Use (max-bits - number_of_leading_zeroes) / 3.32192809489 to give a good estimate of the final number of decimal digits. This is also a good estimate a close power of 10. (#ops ~2)
  4. Iterate the powers_of_10 from that estimate until you find the largest power-of-10 less than your value. (#ops ~8)
    • If you're willing to sacrifice accuracy in the exponent, then this step can be skipped.
  5. Add that power of 10 to itself in a loop until greater than your input. (#ops < ~100) (10 additions of 10 uint64s)
    • If you're willing to sacrifice minor accuracy in the mantissa, then double(input)/double(power_of_ten) will do it in a single division.
    • There's lots of shortcuts that you can take in the bigint->double conversion.
  6. Emit loop_count-1 E power_of_ten_index. (#ops ~4)

If you're willing to sacrifice accuracy in both exponent and mantissa, that leaves ~16 operations that completely ignore the low bits.

Performance

Without writing out a final implementation, performance is hard to guess at, and with a large LUT, cache, and therefore the rest of the program becomes a factor, but here's preliminary data: https://quick-bench.com/q/53k-xSQz7y4iCO7ny66Dz2w62dQ (I ran it several times to try to eliminate outliers)

In my tests, the fastest combination seems to (unsurprisingly) be:

  • Do not iterate the powers_of_ten to confirm the exponent.
  • Do not exactly calculate the mantissa.
  • Use floats to guess the mantissa (preferring speed to accuracy)

From this baseline, we can see:

  • Iterating the powers_of_ten seems to have no noticeable impact on time on average, as long as you also use floats to guess the mantissa. If you do not use floats to guess the mantissa, then the mantissa calculation will take significantly longer. This implies it is not adding significant accuracy on average, and can be skipped to minimize code size.
  • Using floats to guess the mantissa seems to make the algorithm ~5% faster on average, with no impact on accuracy.
  • Iterating to find the exact mantissa slows the algorithm about 16%, but also implies that it's adding accuracy, so I assume you want to keep that.
  • I had two variants of my bigint->double conversion code when estimating the mantissa. One version only ensured the double had at least 1 significant bit minimum, and the other version ensured that the double had >4 significant bits. The additional code to ensure more significant bits had no noticeable impact on time on average, so the additional code/accuracy at that step did not result in noticeably more performance in finding the exact mantissa, and I assume you want to skip that, to minimize code size.

It is also interesting to note that all of these are ~4x faster than converting the bigint to a double and then using printf("%.0E",value);

(I do not vouch for the accuracy of the results of any of this code)

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • If the number is relatively small like 1E003 then I need maximum accuracy. Beyond that accuracy can be sacrified for speed and or simplicity of algorithm. – Danny Cohen Mar 03 '23 at 19:22
  • 1
    10 ops might be optimistic for finding the MSB in a bigint, especially if there are more than 256 bits of leading zeros (one YMM register) so you have to loop more than 1 iteration in [Efficiently find least significant set bit in a large array?](https://stackoverflow.com/q/67605508) . But yeah it's relatively cheap. For the estimate, you would of course multiply by `(1/3.32192809489)` instead of actually dividing; it's just an estimate anyway and the divisor wasn't exact. (Remember, the OP's planning to hand-write in asm, so there's no compiler to do this for them.) – Peter Cordes Mar 03 '23 at 23:02
  • 1
    He said numbers are on the scale of 6*10^120. 7 uint64s, supports up to 7.268387e+134, so he probably has 8 uint64s to be safe. You probably only needs to `ctzll` on 2-3 of them on average, so 4-6 ops, on average. 16 as a rough upper bound. Linear scan is sufficient for this scale of number, unlike your link – Mooing Duck Mar 03 '23 at 23:06
  • 1
    I wonder if the addition / compare steps (or repeated subtraction) could be done on just the leading qword without loss of precision; probably not for every case, but maybe just the leading two qwords? Or maybe some quick check that can determine whether we can skip the low qwords and just do repeated subtraction of the highest qword. (Or binary search for the right multiple of the high qword, or at least compare against `hq*5` or something (one LEA) to cut the search area in half. If it's higher, two doublings instead of 3 additions can produce bigint * 4. – Peter Cordes Mar 03 '23 at 23:08
  • 1
    It just occurred to me that if you always start with `double(input)/double(power_of_ten)`, then you only need 3 additions max... Ah but then you need a multiplication, which basically negates the savings :( – Mooing Duck Mar 03 '23 at 23:09
  • 1
    You can shortcut the addition/compare steps to quickly *estimate* the last digit, but you have to calculate it in full regardless to get it accurate, which is roughly the same number of ops as the repeated addition in the first place – Mooing Duck Mar 03 '23 at 23:12
  • 2
    Ah, good point about the value-range for 6E120, but that's contradicted by the OP's "several YMM registers". Maybe they just haven't done the math, since 8x `uint64_t` fits in two YMM registers. 2x `vpcmpeqq` / `vpmovmskpd` to get two 4-bit bitmaps, and concatenate them (SHL/OR) to get an 8-bit compare map might be a good bet. Or yeah, linear scan since it requires so much processing to get and deal with an element index. OTOH, that trades branch prediction against throughput and latency. – Peter Cordes Mar 03 '23 at 23:15
  • 1
    I am unfamiliar with YMM and am reading up on it now. Seems interesting, but I'd need to know a lot more before I could optimize my answer using it. Someone should make an answer using 2-3 YMM registers. – Mooing Duck Mar 03 '23 at 23:16
  • 1
    A YMM register 256 bits wide, the asm equivalent of C/C++ intrinsic `__m256i` or C# `Vector256`. The existence of packed compare and `vpmovmskps/pd` instructions to get a compare bitmap is pretty useful. My answer on [Efficiently find least significant set bit in a large array?](https://stackoverflow.com/q/67605508) has C and asm, and like I said is fairly easily adaptable to searching for the highest set bit. To extract the qword or dword containing it, that might be most efficient with store to a scratch array and reload, vs a shuffle like `vpermq ymm0, ymm1, ymm2` / `vmovq eax, xmm0`. – Peter Cordes Mar 03 '23 at 23:22
  • 2
    Estimating the leading digit to within one could still be useful, then you just need one bigint multiply (of n-limb by 1-limb, which is pretty cheap with mulx and adox/adcx) and a bigint subtract. Which could potentially be folded together, subtracting partial products from the original instead of producing a 2nd bigint to compare. But you're proposing `double(input)` which would require bigint -> double conversion; hrm, do you just use the manually calculated binary exponent (from clz) and take the top 52 bits below the leading 1 as the mantissa, to convert with truncation? – Peter Cordes Mar 03 '23 at 23:27
  • 1
    @PeterCordes I hadn't considered it, but that seems like a plausible shortcut. – Mooing Duck Mar 03 '23 at 23:31
  • 1
    Binary floating point can't exactly represent large powers of 2, unfortunately. Using https://www.h-schmidt.net/FloatConverter/IEEE754.html we can see that `100000000000` gets rounded to a float32 with value `99999997952`. You'd run into the same problem for `double` at some larger powers of 10, once you have enough factors of 5 that the mantissa can't exactly represent them. That might not be a showstopper if we can construct our table so the error is always in one direction, so we only ever need to check two possible leading-digit values. – Peter Cordes Mar 04 '23 at 02:03
  • 1
    Test-cases just above and below the cutoffs between 99... and 100... for each different magnitude shouldn't take too long to brute-force check, so we might get lucky and find that something works for all the powers of 10 we care about, even if it's hard to prove its safe in general, or if it would definitely fail for some much larger power of 10. – Peter Cordes Mar 04 '23 at 02:08
  • I agree that building the initial in memory const table of power of 10 will be problematic as I will quickly reach big integer numbers that I will need to use big int structs for them too. it kind of defeats the purpose. I will not use floating point math at all. – Danny Cohen Mar 04 '23 at 04:38
  • 1
    *binary FP can't exactly represent large powers of **10*** is of course what I meant to write. Powers of 2, and multiples of powers of 2, are no problem up to the exponent limit. – Peter Cordes Mar 05 '23 at 23:49
  • 2
    @PeterCordes: Correct, the floating point should be rigged to always underestimate, so you only have to check two possibilities. – Mooing Duck Mar 06 '23 at 05:29
  • 1
    @DannyCohen If all numbers fit in 2 YMM registers, that means the max supported value is 1E154, which means the lookup table of powers of 10 would take ~9856 bytes. It's not a massive amount of memory. – Mooing Duck Mar 06 '23 at 05:33
  • 1
    That's a large footprint for L1d cache; I'd only consider a LUT that big if this operation was needed very frequently, or in large enough batches to pay for those cache misses, and the later cache misses in whatever was displaced to make room. (@DannyCohen). A table of every 8th value, plus some computation (multiply by 10e0..7) could cut the table size by a factor of 8. Or if the table can be `double`s instead of bigints, 1240 bytes for 155 entries (exponent 0 to 154), or less if we interpolate between gaps. (log2(1e154) ~= 511.58, so it's in the exponent range) – Peter Cordes Mar 06 '23 at 05:48
  • 1
    @PeterCordes: If you only store every 8th, then you have to do repeated addition to calculate the ones between. You need to ensure you have the right one, which means you have to calculate 2-3 adjacent powers-of-ten, so you can't take many shortcuts. If each addition is 4 operations, then calculating the ones needed is ~32 operations. An L2 cache read is only ~12 cycles, and even L3 is 36 operations. Also you'll have more code, which results in more cache misses. Ergo, As long as the table entries fit into *L3* cache, then the LUT is probably fastest. – Mooing Duck Mar 06 '23 at 06:06
  • 1
    `adc` is about 4/clock throughput. Out-of-order exec can overlap across executions for multiple short bigint adds. If you keep the 8x u64 for one bigint in registers, and some of the 8 qwords for another, then you won't bottleneck on 2/clock loads (or 3/clock in Alder Lake and Zen 3), or any extra uops to store. x86-64 has 15 GP-integer registers not counting the stack pointer. A cache miss costs latency, not throughput unless it stalls so long OoO exec can't hide it, so they're not directly comparable. – Peter Cordes Mar 06 '23 at 06:28
  • 2
    It probably depends how often you need this operation vs. what else your program does. If the rest of your program often loops over largish arrays and blows away cached data between calls to this, then evicting other stuff from cache matters less. But then we'd expect cache misses from our LUT. – Peter Cordes Mar 06 '23 at 06:30
  • @PeterCordes: it just occurred to me that by storing every 8th you only have to do _one_ multiply, by a power of 10, between 1 and 100000000, which is an int, which menas the full multiplies are only as expensive as an addition, so calculating whats needed is only 8-12 ops, relative to 32. (I have also now realized that each addition is 16 ops, rather than 4, so all my numbers are off by 4x, but that simply makes the LUT even better looking) – Mooing Duck Mar 07 '23 at 15:34
  • 1
    @MooingDuck: Yeah, that's what I meant by "multiply by 10e0..7" when I first suggested that. Hadn't realized you were thinking bigint 1-limb x 8-limb would be more expensive than the actual cost of 8 `mul` or `mulx` plus some `adc` or `adox`/`adcx`; I thought you were talking about `double` multiply which might round *up* and spoil our attempt to only consider two values. A `double` LUT might require always rounding down instead of just taking the nearest representable value to 10^n – Peter Cordes Mar 07 '23 at 20:16