A fast method to round a double to a 32-bit int explained

Question

When reading Lua’s source code, I noticed that Lua uses a macro to round double values to 32-bit int values. The macro is defined in the Llimits.h header file and reads as follows:

union i_cast {double d; int i[2]};
#define double2int(i, d, t) \
    {volatile union i_cast u; u.d = (d) + 6755399441055744.0; \
    (i) = (t)u.i[ENDIANLOC];}

Here ENDIANLOC is defined according to endianness: 0 for little endian, 1 for big endian architectures; Lua carefully handles endianness. The t argument is substituted with an integer type like int or unsigned int.

I did a little research and found that there is a simpler format of that macro which uses the same technique:

#define double2int(i, d) \
    {double t = ((d) + 6755399441055744.0); i = *((int *)(&t));}

Or, in a C++-style:

inline int double2int(double d)
{
    d += 6755399441055744.0;
    return reinterpret_cast<int&>(d);
}

This trick can work on any machine using IEEE 754 (which means pretty much every machine today). It works for both positive and negative numbers, and the rounding follows Banker’s Rule. (This is not surprising, since it follows IEEE 754.)

I wrote a little program to test it:

int main()
{
    double d = -12345678.9;
    int i;
    double2int(i, d)
    printf("%d\n", i);
    return 0;
}

And it outputs -12345679, as expected.

I would like to understand how this tricky macro works in detail. The magic number 6755399441055744.0 is actually 2⁵¹ + 2⁵², or 1.5 × 2⁵², and 1.5 in binary can be represented as 1.1. When any 32-bit integer is added to this magic number—

Well, I’m lost from here. How does this trick work?

Update

As @Mysticial points out, this method does not limit itself to a 32-bit int, it can also be expanded to a 64-bit int as long as the number is in the range of 2⁵². (Although the macro needs some modification.)
Some materials say this method cannot be used in Direct3D.
When working with Microsoft assembler for x86, there is an even faster macro written in assembly code (the following is also extracted from Lua source):
```
 #define double2int(i,n)  __asm {__asm fld n   __asm fistp i}
```
There is a similar magic number for single precision numbers: 1.5 × 2²³.

@CoryNelson Fast compared to a simple cast. This method, when implemented properly (with SSE intrinsics) is quite literally a hundred times faster than a cast. (which invokes a nasty function call to a rather expensive conversion code) — Mysticial, Jun 11 '13 at 02:22
Right -- I can see it being faster than `ftoi`. But if you're talking SSE, why not just use the single instruction `CVTTSD2SI`? — Cory Nelson, Jun 11 '13 at 02:38
@CoryNelson For `double -> int32` conversions, that instruction is fine. But for `double -> int64`, this trick becomes king. — Mysticial, Jun 11 '13 at 02:52
@Mysticial: It doesn't quite seem to work for `double -> int64`; the addition maps 2⁵²+ and 2⁵²+1 to the same number. (I guess it's still good for rounding things that aren't in that troublesome range, though. 50-bit numbers or something.) — tmyklebu, Jun 11 '13 at 03:37
@tmyklebu Many of the use cases that go `double -> int64` are indeed within the `2^52` range. These are particularly common when performing integer convolutions using floating-point FFTs. — Mysticial, Jun 11 '13 at 03:55
Related: http://stackoverflow.com/questions/429632/how-to-speed-up-floating-point-to-integer-number-conversion. You can also find explanation + benchmarks [here](http://stereopsis.com/sree/fpu2006.html). — Oak, Jun 11 '13 at 12:45
@Mysticial: This method, when implemented properly should be just as fast as a cast, when implemented properly. Obviously your implementation poorly implements this particular cast. — MSalters, Jun 11 '13 at 14:28
@MSalters Not necessarily true. A cast must live up to the language's specification - including proper handling of overflow and NAN cases. (or whatever the compiler specifies in the case IB or UB) These checks tend to be very expensive. The trick mentioned in this question completely ignores such corner cases. So if you want the speed and your application doesn't care (or never encounters) such corner cases, then this hack is perfectly appropriate. — Mysticial, Jun 11 '13 at 15:15
This same technique is explained in this blog post, except that it is for single-precision instead of double-precision, and with the additional simplification that the number to convert is assumed positive: http://blog.frama-c.com/index.php?post/2013/05/04/nearbyintf3 — Pascal Cuoq, Jul 11 '13 at 17:52
If the double is 1e-100 then it will return 0 which is right. If the double is 1e100 I think you will get the wrong answer. — brian beuning, Jul 28 '14 at 12:52
FWIW, a similar question about efficient `double <-> int64` conversions [recently came up](http://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx). — Mysticial, Jan 10 '17 at 20:11

score 168 · Accepted Answer · edited Jan 16 '21 at 00:17

A value of the double floating-point type is represented like so:

double representation

and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it’s a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.

Now, to the magic number; as you correctly stated, 6755399441055744 is 2⁵¹ + 2⁵²; adding such a number forces the double to go into the “sweet range” between 2⁵² and 2⁵³, which, as explained by Wikipedia, has an interesting property:

Between 2⁵² = 4,503,599,627,370,496 and 2⁵³ = 9,007,199,254,740,992, the representable numbers are exactly the integers.

This follows from the fact that the mantissa is 52 bits wide.

The other interesting fact about adding 2⁵¹ + 2⁵² is that it affects the mantissa only in the two highest bits—which are discarded anyway, since we are taking only its lowest 32 bits.

Last but not least: the sign.

IEEE 754 floating point uses a magnitude and sign representation, while integers on “normal” machines use 2’s complement arithmetic; how is this handled here?

We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (−2³¹ + 1); call it −a. Such a number is obviously made positive by adding the magic number, and the resulting value is 2⁵² + 2⁵¹ + (−a).

Now, what do we get if we interpret the mantissa in 2’s complement representation? It must be the result of 2’s complement sum of (2⁵² + 2⁵¹) and (−a). Again, the first term affects only the upper two bits, what remains in the bits 0–50 is the 2’s complement representation of (−a) (again, minus the upper two bits).

Since reduction of a 2’s complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly (−a) in 32-bit, 2’s complement arithmetic.

"""The other interesting fact about adding 2^51+2^52 is that it affects the mantissa only in the two highest bits - which are discarded anyway, since we are taking only its lowest 32 bits""" What is that ? Adding this may shift all the mantissa ! — YvesgereY, Jun 11 '13 at 18:40
@John: of course, the whole point of adding them is to force the value to be in that range, which obviously can result in shift the mantissa (between the other things) in respect to the original value. What I was saying here is that, once you are in that range, the only bits that differ from the corresponding 53 bits integer are bit 51 and 52, which are discarded anyway. — Matteo Italia, Jun 11 '13 at 21:37
For those who'd like to convert to `int64_t` you can do that by shifting the mantissa left and then right by 13 bits. This will clear the exponent and the two bits from the 'magic' number, but will keep and propagate the sign to the entire 64-bit signed integer. `union { double d; int64_t l; } magic; magic.d = input + 6755399441055744.0; magic.l <<= 13; magic.l >>= 13;` — Wojciech Migda, May 13 '16 at 23:46
Do I understand correctly that 2^51 is only needed to handle negative values? — Kentzo, Jul 21 '20 at 21:09

Chris Dodd · Answer 2 · 2021-01-16T06:56:16.220

This kind of "trick" comes from older x86 processors, using the 8087 intructions/interface for floating point. On these machines, there's an instruction for converting floating point to integer "fist", but it uses the current fp rounding mode. Unfortunately, the C spec requires that fp->int conversions truncate towards zero, while all other fp operations round to nearest, so doing an
fp->int conversion requires first changing the fp rounding mode, then doing a fist, then restoring the fp rounding mode.

Now on the original 8086/8087, this wasn't too bad, but on later processors that started to get super-scalar and out-of-order execution, altering the fp rounding mode generally serializes the CPU core and is quite expensive. So on a CPU like a Pentium-III or Pentium-IV, this overall cost is quite high -- a normal fp->int conversion is 10x or more expensive than this add+store+load trick.

On x86-64, however, floating point is done with the xmm instructions, and the cost of converting
fp->int is pretty small, so this "optimization" is likely slower than a normal conversion.

Specifically, x86-64 makes it cheap because SSE / SSE2 have convert-with-truncation instructions, like [`cvtTsd2si`](https://www.felixcloutier.com/x86/cvttsd2si) as well as convert with current rounding mode [`cvtsd2si`](https://www.felixcloutier.com/x86/cvtsd2si). SSE3 even added x87 `fisttp` for truncating conversion in legacy x87 registers, useful for 32-bit code that's using 32-bit calling conventions which pass float/double in x87. And/or for functions using x87 80-bit for extra precision. — Peter Cordes, Sep 18 '21 at 07:58

score 0 · Answer 3 · answered Aug 27 '21 at 01:44

0

if this helps with visualization, that lua magic value

  (2^52+2^51, or base2 of 110 then [50 zeros]

hex

  0x  0018 0000 0000 0000 (18e12)

octal

  0 300 00000 00000 00000 ( 3e17)

answered Aug 27 '21 at 01:44

RARE Kpop Manifesto

2,453
3
11

Amr Ali · Answer 4 · 2020-08-19T11:57:37.027

-1

Here is a simpler implementation of the above Lua trick:

/**
 * Round to the nearest integer.
 * for tie-breaks: round half to even (bankers' rounding)
 * Only works for inputs in the range: [-2^51, 2^51]
 */
inline double rint(double d)
{
    double x = 6755399441055744.0;  // 2^51 + 2^52
    return d + x - x;
}

The trick works for numbers with absolute value < 2 ^ 51.

This is a little program to test it: ideone.com

#include <cstdio>

int main()
{
    // round to nearest integer
    printf("%.1f, %.1f\n", rint(-12345678.3), rint(-12345678.9));

    // test tie-breaking rule
    printf("%.1f, %.1f, %.1f, %.1f\n", rint(-24.5), rint(-23.5), rint(23.5), rint(24.5));      
    return 0;
}

// output:
// -12345678.0, -12345679.0
// -24.0, -24.0, 24.0, 24.0

edited Aug 19 '20 at 11:57

answered Aug 15 '20 at 11:20

Amr Ali

3,020
1
16
11

1

The point of the trick described in the OP is that it is fast. Or, at least, that 20 years ago it was fast. When Pentium IV's were still common, using type punning to get an integer out of a double was a huge win (see the benchmarks [here](http://stereopsis.com/sree/fpu2006.html), copied from a comment to the OP.) Your version is, of course, simpler. But it doesn't solve the problem of getting that result into an integer variable without the slow cast. (There's also nothing C++-specific about your code.) – rici Aug 15 '20 at 19:12
1

I find the trick useful to round floating point numbers, and not to cast to integers. – Amr Ali Aug 18 '20 at 07:46

A fast method to round a double to a 32-bit int explained

Update

4 Answers4

Linked