Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

Question

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .

To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .

I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .

There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .

I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .

I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?

Wait... I'm confused. You mention IEEE and bits which means you know about FP representation. But at the same time you're asking how to truncate *decimals* which is impossible in binary floating-point. Am I missing something? — Mysticial, Jul 20 '16 at 18:09
@user354900 I disagree, work to the best precision you can, and truncate or round for display. By truncating during the calculations, you will "spread" a bigger error, not less. — Weather Vane, Jul 20 '16 at 18:11
@Mysticial I'm confused too, it all depends on what "chopping mode" means for Intel ... — user354900, Jul 20 '16 at 18:12
@WeatherVane I really don't need those extra decimals and there is no rounding mode that is helping my case / function — user354900, Jul 20 '16 at 18:13
@user354900 "Chopping mode" in IEEE floating point is just truncating the binary representation. In no way does it map to decimal digits unless you happen to be rounding to an integer. — Mysticial, Jul 20 '16 at 18:15
Floating point values (typically) do not *have* "decimal values" - that's a conversion to human readable format. If you don't need them, then you are not doing any further calculations with the values. So you might just as well retain the precision and deal with truncation at output. If you are doing more calculations, then you do need maximum precison. — Weather Vane, Jul 20 '16 at 18:15
@Mysticial Can you code an example for "rouding to an integer with chopping mode " I would like to see that in action — user354900, Jul 20 '16 at 18:16
@user354900: Weather Vane is right. By truncating you lose information, and it makes your results less accurate, i.e. you increase the error. — Rudy Velthuis, Jul 20 '16 at 18:19
@user354900 If you're really asking that question. Then I presume you don't know floating-point math. So I refer you to this: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html — Mysticial, Jul 20 '16 at 18:20
@Mysticial still confused about this, and apparently `FISTTP` doesn't really has an intrinsics equivalent . — user354900, Jul 20 '16 at 18:25
@user354900 Read that article that I linked you and you will understand why your question is fundamentally flawed. (putting aside the fact that this is an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)) When people use words like, "IEEE", "intrinsics", "SSE", "vectorization", and assembly instructions, we habitually assume they know all the basics such as floating-point representation. This is the first time I've seen otherwise. Sorry. — Mysticial, Jul 20 '16 at 18:38
internal calculations should be done in as most precision as possible, that's why x87 uses 80-bit registers — phuclv, Jul 21 '16 at 05:22

Peter Cordes · Answer 1 · 2016-07-21T03:58:07.823

As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index

We’ve finally reached the point in this series that I’ve been waiting for. In this post I am going to share the most crucial piece of floating-point math knowledge that I have. Here it is:

[Floating-point] math is hard.

You just won’t believe how vastly, hugely, mind-bogglingly hard it is. I mean, you may think it’s difficult to calculate when trains from Chicago and Los Angeles will collide, but that’s just peanuts to floating-point math.

(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)

How you could actually implement your bad idea:

There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.

Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.

So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.

The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.

FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.

FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.

This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.

Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.

PS. if you didn't recognize Bruce's HHGTG reference, the original is:

Space is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is. I mean you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.

Could use `double HHGTG_round(double) { return 42; }` – chux - Reinstate Monica Jul 21 '16 at 02:51 — chux - Reinstate Monica, Jul 21 '16 at 02:51

chux - Reinstate Monica · Answer 2 · 2016-07-21T02:45:46.843

2

how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?

The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.

Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."
It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".
Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.

Text Method:

//       - x    xxx...xxx    . xxxxx \0
char buf[1+1+ DBL_MAX_10_EXP+1  +5   +1];
sprintf(buf, "%.5f", x);
x = atof(buf);

~~round()~~ rint() method:

#define SCALE 100000.0
if (fabs(x) < DBL_MAX/SCALE) {
  x = x*SCALE;
  x = rint(x)/SCALE;
}

Direct bit manipulation of x. Simply zero select bits in the significand.
```
TBD code.
```

edited Jul 21 '16 at 02:45

answered Jul 20 '16 at 20:50

chux - Reinstate Monica

143,097
13
135
256

2

I'd recommend [C99 `double rint(double)`](http://en.cppreference.com/w/c/numeric/math/rint) instead of `round()`. `round` has weird rounding semantics that don't match any of the available rounding modes on x86, [so it compiles to worse code for that architecture.](http://stackoverflow.com/a/37624488/224132). `round` can actually inline to a single instruction for ARM, since it seems ARM has an instruction with those semantics. – Peter Cordes Jul 21 '16 at 02:38
@Peter Cordes Good thoughts in your [comment](http://stackoverflow.com/questions/38487718/truncate-floats-and-doubles-after-user-defined-points-in-x87-and-sse-fpus/38490616?noredirect=1#comment64388705_38490616). I see both `rint()` and `nearbyint()` as contenders. Thanks for the tip. Amend the post as you see fit - it is community wiki. – chux - Reinstate Monica Jul 21 '16 at 02:44
1

oh right, yes [`double nearbyint(double)`](http://en.cppreference.com/w/c/numeric/math/nearbyint) should always be at least as efficient, and potentially better since it doesn't have to set errno. Other than that, I think they have identical semantics. – Peter Cordes Jul 21 '16 at 02:49

Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

2 Answers2

How you could actually implement your bad idea: