Fast 1/X division (reciprocal)

Question

Is there some way to improve reciprocal (division 1 over X) with respect to speed, if the precision is not crucial?

So, I need to calculate 1/X. Is there some workaround so I lose precision but do it faster?

This is heavily dependent on the hardware platform you're working on. Also, it also depends on how much precision you're prepared to lose. Obviously, `float recip(float x) { return 1; }` is very fast, but not very accurate... — Oliver Charlesworth, Mar 30 '12 at 08:25
[Single-precision reciprocals run in 5 cycles on the lastest processors. A floating-point multiplication is also 5 cycles.](http://www.agner.org/optimize/instruction_tables.pdf) So I seriously doubt you're gonna get any faster than something like `(float)1/(float)x`. — Mysticial, Mar 30 '12 at 08:25
For starters, what is your platform and compiler? And what kind of data are you operating on? — Johan Kotlinski, Mar 30 '12 at 13:01
@Mysticial Be careful 5 cycles was absolute best case lowest latency but the other number is the worst case number around 37 cycles? Remember the hardware implements an iterative root seeking approximation algorithm like newtons method until the accuracy is sufficient — nimig18, Apr 06 '17 at 03:47

Jack G · Answer 1 · 2020-12-12T18:18:08.700

'

I believe that what he was looking for is a more efficient way to approximate 1.0/x instead of some technical definition of approximation that states that you could use 1 as a very imprecise answer. I also believe that this satisfies that.

#ifdef __cplusplus
    #include <cstdint>
#else
    #include <stdint.h>
#endif

__inline__ double __attribute__((const)) reciprocal( double x ) {
    union {
        double dbl;
        #ifdef __cplusplus
            std::uint_least64_t ull;
        #else
            uint_least64_t ull;
        #endif
    } u;
    u.dbl = x;
    u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> 1;
                                // pow( x, -0.5 )
    u.dbl *= u.dbl;             // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
    return u.dbl;
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
    union {
        float single;
        #ifdef __cplusplus
            std::uint_least32_t uint;
        #else
            uint_least32_t uint;
        #endif
    } u;
    u.single = x;
    u.uint = ( 0xbe6eb3beU - u.uint ) >> 1;
                                // pow( x, -0.5 )
    u.single *= u.single;       // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
    return u.single;
}

Hmm....... I wounder if the CPU manufactures knew you could approximate the reciprocal with only a single multiply, subtraction, and bit-shift when they designed the CPU.... hmm.........

As for bench-marking, the hardware x² instructions combined with the hardware subtraction instructions are just as fast as hardware 1.0/x instruction on modern day computers (my benchmarks were on an Intel i7, but I would assume similar results for other processors). However, if this algorithm were implemented into the hardware as a new assembly instruction, then the increase in speed would probably be good enough for this instruction to be quite practical.

For more information about this method, this implementation is based off the wonderful "fast" inverse square root algorithm.

As Pharap brought to my attention, reading an inactive property from a union is undefined behavior, so there are two possible solutions that I have devised from his helpful comments to avoid undefined behavior. The first solution seems more like an unsavory trick to get around a language semantic that is practically no better than the original solution.

#ifdef __cplusplus
    #include <cstdint>
#else
    #include <stdint.h>
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
    union {
        double dbl[2];
        #ifdef __cplusplus
            std::uint_least64_t ull[2];
        #else
            uint_least64_t ull[2];
        #endif
    } u;
    u.dbl[0] = x; // dbl is now the active property, so only dbl can be read now
    u.ull[1] = 0;//trick to set ull to the active property so that ull can be read
    u.ull][0] = ( 0xbfcdd6a18f6a6f52ULL - u.ull[0] ) >> 1;
    u.dbl[1] = 0; // now set dbl to the active property so that it can be read
    u.dbl[0] *= u.dbl[0];
    return u.dbl[0];
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
    union {
        float flt[2];
        #ifdef __cplusplus
            std::uint_least32_t ull[2];
        #else
            uint_least32_t ull[2];
        #endif
    } u;
    u.flt[0] = x; // now flt is active
    u.uint[1] = 0; // set uint to be active for reading and writing
    u.uint[0] = ( 0xbe6eb3beU - u.uint[0] ) >> 1;
    u.flt[1] = 0; // set flt to be active for reading and writing
    u.flt[0] *= u.flt[0];
    return u.flt[0];
}

The second possible solution is much more palatable because it gets rid of unions altogether. However, this solution will be much slower if it is not properly optimized by the compiler. But, on the upside, the solution below will be completely agnostic of byte-order provided:

that bytes are 8-bits in width
that bytes are the smallest atomic unit on the target machine.
that doubles are 8-bytes wide and floats are 4-bytes wide.

#ifdef __cplusplus
    #include <cstdint>
    #include <cstring>
    #define stdIntWithEightBits std::uint8_t
    #define stdIntSizeOfFloat std::uint32_t
    #define stdIntSizeOfDouble std::uint64_t
#else
    #include <stdint.h>
    #include <string.h>
    #define stdIntWithEightBits uint8_t
    #define stdIntSizeOfFloat uint32_t
    #define stdIntSizeOfDouble uint64_t
#endif

__inline__ double __attribute__((const)) reciprocal( double x ) {
    double byteIndexFloat = 1.1212798184631136e-308;//00 08 10 18 20 28 30 38 bits
    stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
    
    stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
    
    stdIntSizeOfDouble inputAsUll = (
        (inputBytes[0] << byteIndexs[0]) |
        (inputBytes[1] << byteIndexs[1]) |
        (inputBytes[2] << byteIndexs[2]) |
        (inputBytes[3] << byteIndexs[3]) |
        (inputBytes[4] << byteIndexs[4]) |
        (inputBytes[5] << byteIndexs[5]) |
        (inputBytes[6] << byteIndexs[6]) |
        (inputBytes[7] << byteIndexs[7])
    );
    inputAsUll = ( 0xbfcdd6a18f6a6f52ULL - inputAsUll ) >> 1;
    
    double outputDouble;
    
    const stdIntWithEightBits outputBytes[] = {
        inputAsUll >> byteIndexs[0],
        inputAsUll >> byteIndexs[1],
        inputAsUll >> byteIndexs[2],
        inputAsUll >> byteIndexs[3],
        inputAsUll >> byteIndexs[4],
        inputAsUll >> byteIndexs[5],
        inputAsUll >> byteIndexs[6],
        inputAsUll >> byteIndexs[7]
    };
    memcpy(&outputDouble, &outputBytes, 8);
    
    return outputDouble * outputDouble;
}

__inline__ float __attribute__((const)) reciprocal( float x ) {
    float byteIndexFloat = 7.40457e-40; // 0x00 08 10 18 bits
    stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
    
    stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
    
    stdIntSizeOfFloat inputAsInt = (
        (inputBytes[0] << byteIndexs[0]) |
        (inputBytes[1] << byteIndexs[1]) |
        (inputBytes[2] << byteIndexs[2]) |
        (inputBytes[3] << byteIndexs[3])
    );
    inputAsInt = ( 0xbe6eb3beU - inputAsInt ) >> 1;
    
    float outputFloat;
    
    const stdIntWithEightBits outputBytes[] = {
        inputAsInt >> byteIndexs[0],
        inputAsInt >> byteIndexs[1],
        inputAsInt >> byteIndexs[2],
        inputAsInt >> byteIndexs[3]
    };
    memcpy(&outputFloat, &outputBytes, 4);
    
    return outputFloat * outputFloat;
}

Disclaimer: Lastly, please note that I am more of a novice in C++. As such, I welcome with wide open arms any best-practices, proper-formatting, or implication-clarity edits to the end of both improving the quality of this answer for all who read it and expanding my knowledge of C++ for all my years to come.

You could explain the magic number, and what floating point representation it assumes. — Cheers and hth. - Alf, Sep 27 '16 at 02:49
that is very interesting. Thank you! Do you have any results for comparison tests of precision and speed? — klm123, Sep 27 '16 at 05:11
Did you test this against x86's approximate-reciprocal instruction, [`RCPSS`](http://www.felixcloutier.com/x86/RCPSS.html) on your i7? It's as fast as an integer multiply, and doesn't require moving data from XMM registers to integer. You can use it from C++ with `_mm_rcp_ss( _mm_set_ss(x) )`. gcc and clang will convert `1.0/x` into RCPSS + a Newton-Raphson iteration, if you use -ffast-math, but I think you have to use intrinsics manually if you want the value without an approximation step. — Peter Cordes, Nov 08 '16 at 14:40
Note that the `_mm_set_ss` isn't always free even when `x` is already live in an XMM register, because Intel's intrinsics suck for scalars. See http://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction, and also http://stackoverflow.com/questions/40416570/issues-of-compiler-generated-assembly-for-intrinsics for silly compilers wasting instructions zeroing high elements when only the low element is going to be used. — Peter Cordes, Nov 08 '16 at 14:46
@PeterCordes, how do you get the compilers to use RCPSS + Newton-Raphson? A quick test did not achieve this https://godbolt.org/g/3yAJGb. BTW, you stopped answering questions after 1.1.2017. What happened? — Z boson, Feb 05 '17 at 13:44
@Zboson: I decided to take a break from SO for a bit and spend time catching up on some audiobooks, TV shows and stuff. And play some games / do other stuff on my new Skylake desktop. :) I expect I'll get back into answering SO questions more regularly again at some point, but I'm kind of an all-or-nothing person when it comes to what I spend my time on. My brother was home visiting over new years, and that got me out of all-SO all-the-time mode. — Peter Cordes, Feb 06 '17 at 15:46
@Zboson: Hmm, gcc uses RCPPS on a `__m128` automatically in cases where it won't use RCPSS on a `float`. e.g. https://godbolt.org/g/RsbUHz. That's odd. Perhaps my previous comment is an over-simplification. I *though* I'd seen some compilers use RCPSS automatically, but apparently they don't always. Or maybe don't ever? I think they probably do in some circumstances, but I don't know which ones. (Also, gcc in that godbolt link fails to use FMA for the Newton iteration after RCPPS :(. Clang does, though.) — Peter Cordes, Feb 06 '17 at 15:56
@PeterCordes did you know that you can have multiple compilers windows in godbolt? I discovered that today. Here is GCC and Clang although Clang does not compile. I don't know what you did to get Clang to compile (without using intrinsics) https://godbolt.org/g/7ahBSt — Z boson, Feb 06 '17 at 19:26
@Pharap Could you please extrapolate on this case of UB. I am rather inexperience with C++ and struggle immensely with getting rid of UB. I would relish an explanation and/or an online resource to learn more. Thank you so much for bringing this issue to my attention. — Jack G, Jun 10 '19 at 03:26
@JackGiffin When using a `union`, only the 'active' member can be read from. Reading from an 'inactive' member is undefined behaviour. A member becomes 'active' when it is assigned to, at which point the member that was previously 'active' becomes 'inactive' and can no longer legally be read from. Many compilers allow reading from inactive members as an extension, but by the terms of the C++ standard it's undefined behaviour, thus is could legally do anything - it could segfault, demons could fly out of your nose et cetera. See also: https://en.cppreference.com/w/cpp/language/union#Explanation — Pharap, Jun 10 '19 at 20:12
@JackGiffin The only solution to this problem that I can think of that wouldn't invoke undefined behaviour is to take the address of the `double` (creating a `double *`), converting it to an `unsigned char *` and operating on that. It would have to be `char *` or `unsigned char *`, you couldn't convert to `unsigned *` for reasons described [here](https://stackoverflow.com/a/13996008). (I can't find the relevant cppreference page at the moment, but there is a page that explains which pointer conversions are allowed.) — Pharap, Jun 10 '19 at 20:24
@Pharap I have done my best with my limited knowledge of C++ to cover the issue of undefined behavior in the program with the helpful information you provided. Thank you so much for helping me with your insights. If convenient, could you please check over my new code to ensure that I did not completely botch it up. Thank you. — Jack G, Jun 11 '19 at 00:09
@JackG The "that's undefined behavior!" crowd take their concern a bit too far. There are two concerns (a) compiler over-optimizations (b) exceptions on some processors (e.g. reading an invalid floating-point format which causes an exception when certain FPU flags are set). For (a) `std::launder` acts as an optimization barrier that prevents the optimizer from performing constant propagation. For (b) which applies to reinterpret_cast too, it can't be addressed unilaterally because it depends on the CPU you run on, but then, presumably you know your target platforms. — Dwayne Robinson, Feb 28 '22 at 21:20
If I were a CPU designer, I would not waste an instruction only for calculating an approximation. CPU are generally made for a broad usage, that most of the usages requires the exact value. CPU designers count on you to do these tricks if you want speed :) — Kapandaria, May 12 '22 at 20:48
Technically, using union this way is undefined behaviour in C++, but most compilers including GCC, Clang, and MSVC allow this use as an extension. See the part of `-fstrict-aliasing` at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html — xiver77, Jun 13 '22 at 04:25

score 5 · Answer 2 · answered Mar 30 '12 at 08:28

5

First, make sure this isn't a case of premature optimization. Do you know that this is your bottleneck?

As Mystical says, 1/x can be calculated very quickly. Make sure you're not using double datatype for either the 1 or the divisor. Floats are much faster.

That said, benchmark, benchmark, benchmark. Don't waste your time spending hours on numerical theory just to discover the source of the poor performance is IO access.

answered Mar 30 '12 at 08:28

Mahmoud Al-Qudsi

28,357
12
85
125

5

"Floats are much faster" - really? It's dangerous to make such sweeping statements. There are a lot of things you can do to change the code the compiler generates. It also depends on the hardware the compiler is aiming for. For example, on the IA32, the code generated by gcc when not using SSE (the -mfpmath=387 option I think) will be the same speed for double and float since the FPU only deals with 80bit values, any speed difference will be down to memory bandwidth. – Skizz Mar 30 '12 at 08:41
1

Yes, obviously it's a blanket statement. But the question was equally generic. Have the OP give specifics, and I'd be in a position to make a more "eloquent" response. – Mahmoud Al-Qudsi Mar 30 '12 at 08:42
2

1/x can be calculated quickly .. but how do you make a compiler actually emit RCPSS? – harold Mar 30 '12 at 09:06
1

Mystical was being 'mystical' in that statement, he quoted the lower bound 5-clk's which was actually the inverse throughput and should have been 6-clk cycles **best case** scenario the other end is the **worst case** scenario was around 38-clks. {6-21,7-21,10-24,10-15,14-16,11-25,12-26,10-24,9-38,9-37,19,22,19,38,6-38} In these cases the average Clocks Per Instruction (CPI) must be used to compare apples and oranges. – nimig18 Apr 06 '17 at 04:09
2

Downvoted because most of this answer didn't actually answer anything, and the tiny part that did didn't even try. – Veedrac Oct 20 '17 at 13:30
2

Seems that it is off topic - OP was asking for method to calculate fast reciprocals for floats not for optimization tutorial – Jacek Blaszczynski Aug 12 '18 at 01:31
Feels like Clippy sometimes: "It looks like you are trying to prematurely optimize ..." – LegendLength Jul 12 '22 at 10:25

score 3 · Answer 3 · edited Jun 16 '22 at 03:59

I’ve bench tested these methods on Arduino NANO for speed and 'accuracy'.
Basic calculation was to setup variables, Y = 15,000,000 and Z = 65,535
(in my real case, Y is a constant and Z can vary between 65353 and 3000 so a useful test)
Calc time on Arduino was established by putting pin low, then high as calc made and then low again and comparing on times with logic analyser. FOR 100 CYCLES. With variables as unsigned integers:-

Y * Z takes 0.231 msec
Y / Z takes  3.867 msec.  
With variables as floats:-  
Y * Z takes  1.066 msec
Y / Z takes  4.113 msec.  
Basic Bench Mark  and ( 15,000,000/65535 = 228.885 via calculator.)

Using {Jack G’s} float reciprocal algorithm:

Y * reciprocal(Z)  takes  1.937msec  which is a good improvement, but accuracy less so 213.68.

Using {nimig18’s} float inv_fast:

Y* inv_fast(Z)  takes  5.501 msec  accuracy 228.116  with single iteration  
Y* inv_fast(Z)  takes  7.895 msec  accuracy 228.883  with second iteration

Using Wikipedia's Q_rsqrt (pointed to by {Jack G})

Y * Q*rsqrt(Z) takes  6.104 msec  accuracy   228.116  with single iteration  
All entertaining but ultimately disappointing!

score 3 · Answer 4 · answered Mar 30 '12 at 08:37

First of all, if you turn on compiler optimizations, the compiler is likely to optimize the calculation if possible (to pull it out of a loop, for example). To see this optimization, you need to build and run in Release mode.

Division may be heavier than multiplication (but a commenter pointed out that reciprocals are just as fast as multiplication on modern CPUs, in which case, this isn't correct for your case), so if you do have 1/X appearing somewhere inside a loop (and more than once), you can assist by caching the result inside the loop (float Y = 1.0f/X;) and then using Y. (The compiler optimization might do this in any case.)

Also, certain formulas can be redesigned to remove division or other inefficient computations. For that, you could post the larger computation being performed. Even there, the program or algorithm itself can sometimes be restructured to prevent the need for hitting time-consuming loops from being hit as frequently.

How much accuracy can be sacrificed? If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.

However, in general, there's no way to speed up division. If there were, compilers would already be doing it.

*If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.* How? — klm123, Mar 30 '12 at 08:39
I did not mean to imply "trivial". Also, I should have added the caveat that X >> 1 (see end of comment). In that case, you can take advantage of X^-1 = 2^(-log_2(X)), and use http://en.wikipedia.org/wiki/Find_first_set#Algorithms to get the order of magnitude of log_2(X), to obtain the order of magnitude in the form 2^-n. If upper and lower bounds on X is known, this could be used to improve speed. If the other quantities in the calculation (not shown in the question) have known bounds and are somewhat commensurate in order of magnitude, they can be scaled and cast to integers. — Dan Nissenbaum, Mar 30 '12 at 09:20
Compilers can only hoist Y=1.0f/X out of a loop if you use `-ffast-math`, so it is a good idea to do that in the source if you don't plan to enable `-ffast-math` to tell the compiler that you don't care where/when/how rounding happens, or in what order. — Peter Cordes, Nov 08 '16 at 15:00

score 1 · Answer 5 · edited Apr 06 '17 at 15:42

This should do it with a number of pre-unrolled newton iterations's evaluated as a Horner polynomial which uses fused-multiply accumulate operations most modern day CPU's execute in a single Clk cycle (every time):

float inv_fast(float x) {
    union { float f; int i; } v;
    float w, sx;
    int m;

    sx = (x < 0) ? -1:1;
    x = sx * x;

    v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
    w = x * v.f;

    // Efficient Iterative Approximation Improvement in horner polynomial form.
    v.f = v.f * (2 - w);     // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
    // v.f = v.f * ( 4 + w * (-6 + w * (4 - w)));  // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
    // v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w)))))));  // Third Iteration, Err = +-6.8e-8 *  2^(-flr(log2(x)))

    return v.f * sx;
}

Fine Print: Closer to 0, the approximation does not do so well so either you the programmer needs to test the performance or restrict the input from getting to low before resorting to hardware division. i.e. be responsible!

can you explain the meaning of the bit pattern 0x7EF127EA ? I guess it's to normalize x to ~ 1.0, but how to achieve that? — minhhn2910, Dec 17 '18 at 07:51

score 0 · Answer 6 · answered Jan 16 '14 at 03:43

0

The fastest way that I know of is to use SIMD operations. http://msdn.microsoft.com/en-us/library/796k1tty(v=vs.90).aspx

answered Jan 16 '14 at 03:43

cdiggins

17,602
7
105
102

1

Or to buy a faster cpu?:) The question is algorithmical. – klm123 Jan 16 '14 at 12:16
3

RCPSS/ [RCPPS](http://www.felixcloutier.com/x86/RCPPS.html) is a good suggestion. Fast approximate inverse (and inverse sqrt) are available in hardware on x86, for scalar or a SIMD vector of floats. You don't have to be using SIMD for the rest of your loop to take advantage. If this answer had explained any of that, it wouldn't have gotten such confused comments. – Peter Cordes Nov 08 '16 at 15:03

score 0 · Answer 7 · answered Oct 13 '22 at 17:11

0

The rcpss assembly instruction computes an approximate reciprocal with |Relative Error| ≤ 1.5 ∗ 2^−12.

You can enable it on a compiler with the -mrecip flag (you might also need -ffast-math).

The instrinsic is _mm_rcp_ss.

answered Oct 13 '22 at 17:11

Richard

56,349
34
180
251

What hardware is required for this? – Karl Knechtel Oct 13 '22 at 17:13

Fast 1/X division (reciprocal)

7 Answers7

'

Linked