14

My program frequently requires the following calculation to be performed:

Given:

  • N is a 32-bit integer
  • D is a 32-bit integer
  • abs(N) <= abs(D)
  • D != 0
  • X is a 32-bit integer of any value

Find:

  • X * N / D as a rounded integer that is X scaled to N/D (i.e. 10 * 2 / 3 = 7)

Obviously I could just use r=x*n/d directly but I will often get overflow from the x*n. If I instead do r=x*(n/d) then I only get 0 or x due to integer division dropping the fractional component. And then there's r=x*(float(n)/d) but I can't use floats in this case.

Accuracy would be great but isn't as critical as speed and being a deterministic function (always returning the same value given the same inputs).

N and D are currently signed but I could work around them being always unsigned if it helps.

A generic function that works with any value of X (and N and D, as long as N <= D) is ideal since this operation is used in various different ways but I also have a specific case where the value of X is a known constant power of 2 (2048, to be precise), and just getting that specific call sped up would be a big help.

Currently I am accomplishing this using 64-bit multiply and divide to avoid overflow (essentially int multByProperFraction(int x, int n, int d) { return (__int64)x * n / d; } but with some asserts and extra bit fiddling for rounding instead of truncating).

Unfortunately, my profiler is reporting the 64-bit divide function as taking up way too much CPU (this is a 32-bit application). I've tried to reduce how often I need to do this calculation but am running out of ways around it, so I'm trying to figure out a faster method, if it is even possible. In the specific case where X is a constant 2048, I use a bit shift instead of multiply but that doesn't help much.

Taron
  • 191
  • 9
  • 1
    For the case that X is a power of 2: [Divide 64-bit integers as though the dividend is shifted left 64 bits, without having 128-bit types](https://stackoverflow.com/q/54889529/995714). For the general case: [Most accurate way to do a combined multiply-and-divide operation in 64-bit?](https://stackoverflow.com/q/8733178/995714), [(a * b) / c MulDiv and dealing with overflow from intermediate multiplication](https://stackoverflow.com/q/54232987/995714)... – phuclv Aug 01 '19 at 01:42
  • X = 0, that will give you 0, which is in 0..X range. – lenik Aug 01 '19 at 01:47
  • @phuclv Thanks, that does give me some good starting points to look. I should have mentioned I'm not using MSVC so need something that doesn't depend on it, and again the classic method of using a higher-bit variable is what I'm already doing but it is slow. – Taron Aug 01 '19 at 02:04
  • What CPU do you use? Arm/x86? – geza Aug 01 '19 at 02:13
  • 1
    Have you tried `return (long long)x * n / d`? – chux - Reinstate Monica Aug 01 '19 at 02:23
  • Are `n,d` relatively stable? Do they change often? – chux - Reinstate Monica Aug 01 '19 at 02:28
  • You mentioned `X * N / D as an integer`. Just to clarify, is the result guaranteed to be an integer, or you want the floored (or rounded ) integer? – ph3rin Aug 01 '19 at 02:30
  • Seriously, overflow should rarely be a problem and if it is for your application, then you should make a 64 bits application so that you will get better speed for 64 bit operation. If you are under Windows, you might want to use `MulDiv` which does exactly that (https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-muldiv). You might check if your compiler or CPU has something similar. In any case, you should avoid writing code like `multByProperFraction`. – Phil1970 Aug 01 '19 at 02:55
  • The first step would be to delete the function. The second step would be to analyse every place that you get an "multByProperFraction doesn't exist" error and replace the code with new code that's optimized for that specific case (for the accuracy and ranges that are actually used at that place), including merging parts of the calculation throughout the surrounding code (e.g. maybe in one place its a loop looking for the largest result and you can do one division after the loop instead of many divisions). – Brendan Aug 01 '19 at 02:59
  • @Kaenbyou Rin I just want the result to be an integer, ideally rounded. – Taron Aug 01 '19 at 04:35
  • 1
    @Phil1970 Overflow is a problem because I am using fixed-point math in these cases (again, lack of floats as I mentioned) so the integers are often quite large. – Taron Aug 01 '19 at 04:35
  • @chux - Yes, that's still a 64-bit operation which is what I'm trying to avoid. The only stable case I have is the one I mentioned where X is 2048 (sorry I earlier said I had one where D was stable but I got X and D swapped in my brain). – Taron Aug 01 '19 at 15:37
  • 1
    @Taron Avoidance of 64-bit operations is not the goal - speed is the goal. A 64-bit op shift is fast. It is the 64-bit _divide_ we need to avoid. – chux - Reinstate Monica Aug 01 '19 at 16:01
  • @chux quite right, it is the 64-bit divide that is the problem, good point. – Taron Aug 01 '19 at 17:19
  • Any chance you can go totally old school and use a large reciprocal lookup table for your `d` value? This is what we did way back when we used integer math in texture mapping code before GPUs were a thing. – Michael Dorgan Aug 01 '19 at 17:41
  • I'm also curious what is too slow for your 64-bit divider? Is it takes 100s of cycles or is it just because you are in a very tight loop and the divide overwhelms the rest of the math? – Michael Dorgan Aug 01 '19 at 17:44
  • @MichaelDorgan LOL, I did not expect to run into you on Stack Overflow old friend. I was literally thinking "what would Dorgan do?" when I saw __udivmoddi4 pop up as the function the CPU was spending the most time on in NxCPUProfiler on my 32-bit application. It is being used in my physics code, which is all purely fixed-point to guarantee the same results in replays even across platforms. – Taron Aug 01 '19 at 18:06
  • @MichaelDorgan To answer your questions, no I can't use a table. I don't know how long the divide actually takes I just know it is the top "Hot Function by Self" in the profiler (you'd know better than me what __udivmoddi4 does in 32-bit NX apps!). The primary place it is used is to look up an entry in a table as part of a fixed-point CORDIC-based atan2 function that compares X to Y and scales X/Y (or Y/X) to 0-2048 and uses that as an index in a look-up table for initial value of the arctan (then refines it from there using CORDIC), and I drop fps when a lot of angle-based physics occur. – Taron Aug 01 '19 at 18:25
  • Taron, use of a profiler: good. Identifying `multByProperFraction()` as the "Hot function": good. Focusing solely on that function: not so good. 1) try replacing with `int multByProperFraction() { return 42; }`, see if _performance_ is OK. 2) I suspect the best answer is instead to optimize the higher level code in conjunction with this function. – chux - Reinstate Monica Aug 02 '19 at 15:21
  • @chux Good advice but already done. I profiled and optimized higher code, including eliminating about half the calls to that func, and had run out of ideas for how to optimize elsewhere before posting this. I just bothered me that ````__udivmoddi4```` was still the top function when lots of collisions were happening, even if it is now only 3.2% CPU and no longer drops fps. I was hoping there was a specialized math or bit twiddling that could eliminate the need for a divide if N <= D. I agree the best path is to stop using this function generically, but I'll still try to speed it up if I can. – Taron Aug 02 '19 at 18:06
  • OMG - Taron. Lol. If this is a Nintendo question, send a response through the proper channels and it will get to my desk pretty quickly. Fixing stuff like this is kinda my job :) – Michael Dorgan Aug 02 '19 at 18:21
  • If you ask something on the forums, mention me specifically and that will get it here even quicker. – Michael Dorgan Aug 02 '19 at 18:30

3 Answers3

3

Tolerate imprecision and use the 16 MSBits of n,d,x

Algorithm
while (|n| > 0xffff) n/2, sh++
while (|x| > 0xffff) x/2, sh++
while (|d| > 0xffff) d/2, sh--
r = n*x/d  // A 16x16 to 32 multiply followed by a 32/16-bit divide.
shift r by sh.

When 64 bit divide is expensive, the pre/post processing here may be worth to do a 32-bit divide - which will certainly be the big chunk of CPU.

If the compiler cannot be coaxed into doing a 32-bit/16-bit divide, then skip the while (|d| > 0xffff) d/2, sh-- step and do a 32/32 divide.

Use unsigned math as possible.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 1
    I like where you are going with this (I did say precision wasn't as important as speed in my question after all). I was thinking though, since I know n <= d, I know r will be <= x, so there's no reason for d to have more bits than x (or only 1 more bit maybe?) as the extra precision would be lost anyway. Would it thus make sense to check if x is already 16-bit or less in value, and if so, reduce n and d to also be 16-bit, but if not do the full 64-bit divide? That would eliminate loss of precision but still speed up any cases where X happens to be <= 0xFFFF, right? – Taron Aug 01 '19 at 17:04
  • @Taron If you really want good complete code, make a test harness (test code) that approximates performance and provides a rating. Suggest that you start with `(uint64_t)x*n/d`, and your other ideas. – chux - Reinstate Monica Aug 02 '19 at 15:04
  • @Taron Re: "Would it thus make sense to check if x is already 16-bit or less in value": Each check cost time. Are you trying to optimize _average_ time (yes do the check) or _worst case_ time (then do not check)? ` – chux - Reinstate Monica Aug 02 '19 at 15:12
  • @Taron `while (|n| > 0xffff) n/2, sh++` can likely be coded, not as while loop, but as more efficient code. In essence, `while (|n| > 0xffff) n/2` is a optimizing sub-problem - IMO, this could be done with just 3 to 4 `if`s. – chux - Reinstate Monica Aug 02 '19 at 15:13
  • well, when I asked, I was thinking about the specific case where I know X is a constant 2048, and figured in that case at least I could immediately reduce n and d even if I leave the general-use function as a 64-bit divide. There may be other cases where outside code could know the ranges of X, N, and D and whether 64-bit is actually necessary beforehand without needing a generic function that checks every time. Either way I was planning to profile and test various solutions (including some interesting ones I saw in other questions) to find which, if any, actually increased performance. – Taron Aug 02 '19 at 17:00
  • See my update in the question for my attempt to optimize this algorithm, which turned out to be the best in terms of a good balance of precision and speed. I'm honestly not sure why calculating msb and then shifting by that was any faster than using while loops, especially since it was still faster even when the msb function also used a while loop, but it made a difference of like 3-4 times the speed in my benchmarks! Maybe better branch prediction? Anyway, if you have any suggestions for further sub-optimizations from what I came up with please share! – Taron Aug 06 '19 at 02:20
  • @Taron Your update could be an answer. https://stackoverflow.com/help/self-answer – chux - Reinstate Monica Aug 06 '19 at 02:23
  • this is the first question I've ever posted so new to the etiquette, and I wasn't sure if it was really fair to post it as an answer given that the solution is your algorithm just with some optimization. – Taron Aug 06 '19 at 02:31
2

The basic correct approach to this is simply (uint64_t)x*n/d. That's optimal assuming d is variable and unpredictable. But if d is constant or changes infrequently, you can pre-generate constants such that exact division by d can be performed as a multiplication followed by a bitshift. A good description of the algorithm, which is roughly what GCC uses internally to transform division by a constant into multiplication, is here:

http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html

I'm not sure how easy it is to make it work for a "64/32" division (i.e. dividing the result of (uint64_t)x*n), but you should be able to just break it up into high and low parts if nothing else.

Note that these algorithms are also available as libdivide.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I did see libdivide earlier in my searching for answers for this and thought about using it, but there's only one specific situation where d is constant and predictable, so I was hoping for something that would work generically but still get some speed boost given the specific limitations I mentioned (like n always being guaranteed to be smaller than d, hence why I mentioned a "proper fraction" instead of just MulDiv or integer scaling, whereas other algorithms I've seen expect that n can be larger than d). – Taron Aug 01 '19 at 04:42
  • @Taron: Assuming `d` is unpredictable, the divide dominates the time spent on the whole operation. There's nothing clever you can do to get around that. The mul is essentially free on any modern system. – R.. GitHub STOP HELPING ICE Aug 01 '19 at 04:44
  • Actually just to correct myself, I don't even have any situation where d is constant, I just have the one I mentioned in the question where X is constant. – Taron Aug 01 '19 at 15:35
  • @Taron: Obviously that just lets you replace the mul by a shift, but it's unlikely to matter much anyway. The divide dominates and can't be eliminated. – R.. GitHub STOP HELPING ICE Aug 01 '19 at 16:00
1

I've now benchmarked several possible solutions, including weird/clever ones from other sources like combining 32-bit div & mod & add or using peasant math, and here are my conclusions:

First, if you are only targeting Windows and using VSC++, just use MulDiv(). It is quite fast (faster than directly using 64-bit variables in my tests) while still being just as accurate and rounding the result for you. I could not find any superior method to do this kind of thing on Windows with VSC++, even taking into account restrictions like unsigned-only and N <= D.

However, in my case having a function with deterministic results even across platforms is even more important than speed. On another platform I was using as a test, the 64-bit divide is much, much slower than the 32-bit one when using the 32-bit libraries, and there is no MulDiv() to use. The 64-bit divide on this platform takes ~26x as long as a 32-bit divide (yet the 64-bit multiply is just as fast as the 32-bit version...).

So if you have a case like me, I will share the best results I got, which turned out to be just optimizations of chux's answer.

Both of the methods I will share below make use of the following function (though the compiler-specific intrinsics only actually helped in speed with MSVC in Windows):

inline u32 bitsRequired(u32 val)
{
    #ifdef _MSC_VER
        DWORD r = 0;
        _BitScanReverse(&r, val | 1);
        return r+1;
    #elif defined(__GNUC__) || defined(__clang__)
        return 32 - __builtin_clz(val | 1);
    #else
        int r = 1;
        while (val >>= 1) ++r;
        return r;
    #endif
}

Now, if x is a constant that's 16-bit in size or smaller and you can pre-compute the bits required, I found the best results in speed and accuracy from this function:

u32 multConstByPropFrac(u32 x, u32 nMaxBits, u32 n, u32 d)
{
    //assert(nMaxBits == 32 - bitsRequired(x));
    //assert(n <= d);
    const int bitShift = bitsRequired(n) - nMaxBits;
    if( bitShift > 0 )
    {
        n >>= bitShift;
        d >>= bitShift;
    }

    // Remove the + d/2 part if don't need rounding
    return (x * n + d/2) / d;
}

On the platform with the slow 64-bit divide, the above function ran ~16.75x as fast as return ((u64)x * n + d/2) / d; and with an average 99.999981% accuracy (comparing difference in return value from expected to range of x, i.e. returning +/-1 from expected when x is 2048 would be 100 - (1/2048 * 100) = 99.95% accurate) when testing it with a million or so randomized inputs where roughly half of them would normally have been an overflow. Worst-case accuracy was 99.951172%.

For the general use case, I found the best results from the following (and without needing to restrict N <= D to boot!):

u32 scaleToFraction(u32 x, u32 n, u32 d)
{
    u32 bits = bitsRequired(x);
    int bitShift = bits - 16;
    if( bitShift < 0 ) bitShift = 0;
    int sh = bitShift;
    x >>= bitShift;

    bits = bitsRequired(n);
    bitShift = bits - 16;
    if( bitShift < 0 ) bitShift = 0;
    sh += bitShift;
    n >>= bitShift;

    bits = bitsRequired(d);
    bitShift = bits - 16;
    if( bitShift < 0 ) bitShift = 0;
    sh -= bitShift;
    d >>= bitShift;

    // Remove the + d/2 part if don't need rounding
    u32 r = (x * n + d/2) / d;
    if( sh < 0 )
        r >>= (-sh);
    else //if( sh > 0 )
        r <<= sh;

    return r;
}

On the platform with the slow 64-bit divide, the above function ran ~18.5x as fast as using 64-bit variables and with 99.999426% average and 99.947479% worst-case accuracy.

I was able to get more speed or more accuracy by messing with the shifting, such as trying to not shift all the way down to 16-bit if it wasn't strictly necessary, but any increase in speed came at a high cost in accuracy and vice versa.

None of the other methods I tested came even close to the same speed or accuracy, most being slower than just using the 64-bit method or having huge loss in precision, so not worth going into.

Obviously, no guarantee that anyone else will get similar results on other platforms!

EDIT: Replaced some bit-twiddling hacks with plain code that actually ran faster anyway by letting the compiler do its job.

Taron
  • 191
  • 9
  • Does `u32 bits = bitsRequired(x); u32 bitShift = bits - 16; bitShift &= -(bitShift <= bits); ` work when `bits < 16`? (Be sure `bitShift` in is valid range. – chux - Reinstate Monica Aug 06 '19 at 02:38
  • 1
    @chux Yes it does. The ````bitShift &= -(bitShift <= bits);```` line makes bitShift become 0 when bits is < 16 (after bitShift underflows). I got that technique from http://locklessinc.com/articles/sat_arithmetic/ – Taron Aug 06 '19 at 02:42
  • 1) Try `return (((x * n + d/2) / (unint16_t) d) << lsh) >> rsh;` to see if the compiler takes advantage of a 32/16 divide. – chux - Reinstate Monica Aug 06 '19 at 04:05
  • 2) If not, then no need for reducing `d`.and `rsh` is always 0. – chux - Reinstate Monica Aug 06 '19 at 04:06
  • When I tried removing the reduction for d and the rsh as per your previous comment the accuracy dropped significantly for some reason. I haven't tried casting d to a u16 though, should that potentially improve performance? I'll give it a try tomorrow. – Taron Aug 06 '19 at 04:12
  • Since `n <= x`, could improve accuracy when `n <= 0x7FFF` by allowing `x` to be 17 bits, etc. I.e. When final `n_bitShift` is 0 , then more bits may be left in `x`. – chux - Reinstate Monica Aug 06 '19 at 04:21
  • @chux n is not guaranteed to be less than x, it is only less than d. X can be any value (unless you mean in the case of x being a constant). Regardless, I initially tried to only reduce n and x as needed so their combined bits were low enough for the product to fit in 32 bits, but while I gained a bit of accuracy I lost so much performance from the extra code required that it didn't seem worth it. Sometimes accuracy was worse too in cases of large differences in size of n and x that caused more logic needed to compensate. – Taron Aug 06 '19 at 04:30
  • IAC, the idea being when `n,x` is 15 bits or less, a few more than 16 bits may be retained in the other. Yes, agree, it is a trade-off of accuracy vs. time. I am surprised about [removing the reduction for d](https://stackoverflow.com/questions/57300788/fast-method-to-multiply-integer-by-proper-fraction-without-floats-or-overflow/57368183?noredirect=1#comment101223105_57368183) Hmmm TTFN – chux - Reinstate Monica Aug 06 '19 at 04:34
  • @chux I tried again to be sure, and when I comment out the reduction of ````d```` and ````rsh````, accuracy drops from 99.999303% to 92.735752% in the general case, with no change in average speed. Worse, in a version of that function that has ````x```` as a constant ````0x0800````, accuracy drops from 99.988572% to 53.121537% (though speed is much more noticeably improved)! I'm not good enough at math to understand why this is conceptually. Sadly, using ````(unint16_t) d```` made no difference, though I currently only have access to 2 of the 9 platforms this code base runs on to test that. – Taron Aug 06 '19 at 15:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197584/discussion-between-chux-and-taron). – chux - Reinstate Monica Aug 06 '19 at 20:36