2

I have inherited a pretty interesting piece of code:

inline int round(float a)
{
  int i;
  __asm {
    fld   a
    fistp i
  }
  return i;
}

My first impulse was to discard it and replace calls with (int)std::round (pre-C++11, would use std::lround if it happened today), but after a while I started to wonder if it might have some merit after all...


The use case for this function are all values in [-100, 100], so even int8_t would be wide enough to hold the result. fistp requires at least a 32 bit memory variable, however, so less than int32_t is just as wasted as more.

Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards. C++11 offers the std::lround function, which alleviates this particular issue, but still does seem to be more wasteful, considering that the value passes float->long->int instead of directly arriving where it should.

On the other hand, with inline-ASM in the function, the compiler cannot optimise away i into a register (and even if it could, fistp expects a memory variable), so std::lround does not seem too much worse...

The most pressing question I have is however how safe it is to assume (as this function does), that the rounding mode will always be round-to-nearest, as it obviously does (no checks). As std::lround has to guarantee a certain behaviour independent of rounding mode, this assumption, as long as it holds, always seems to make the inline-ASM round the better option.

It is furthermore highly unclear to me whether the rounding mode set by std::fesetround and used by the std::lround alternative std::lrint and the rounding mode employed in the fistp ASM-instruction are guaranteed to be the same or at least synchronous.


These are my considerations, aka what I do not know to make an informed decision on retaining or replacing the function.

Now to the questions:


Following a more informed view of these considerations or such which I have not thought of, does it seem advisable to use this function?

How great is the risk, if any?

Does reasoning exist for why it would not be faster than std::lround or std::lrint?

Can it be further improved without performance cost?

Does any of this reasoning change if the program were compiled for x86-64?

Zsar
  • 443
  • 3
  • 14
  • 4
    Let me toss a few other considerations out there: Portability to other compilers is going to be an issue, since they tend to handle inline asm differently. Also, portability to other platforms is an issue. You may not plan to move this code to ARM anytime soon, but even porting to x64 presents problems (ie VS x64 doesn't allow inline asm). There is also the question of maintainability: will future maintainers know what this asm does? Code that people don't understand/are afraid to touch is a cost. IMO, if you can't define a REALLY clear benefit from inline asm, you're just showing off. FWIW. – David Wohlferd Jun 03 '16 at 21:47
  • @DavidWohlferd: It's not just showing off, it's shooting yourself in the foot and defeating the optimizer by introducing an opaque black-box function. Anyway, there's already a [canonical `float`->`int` Q&A](http://stackoverflow.com/questions/2035959/how-to-convert-a-float-to-an-int-by-rounding-to-the-nearest-whole-integer) so I focussed my answer here on the asm-specific part of this question. – Peter Cordes Jun 03 '16 at 22:50
  • @PeterCordes : Whoa, that link cannot possibly be considered canonical. It is thorougly awful and the only useful information can be found in _two comments_ to a stub answer, accepted or not. Nay, one cannot direct anyone there in good conscience. Also not even a single C-family tag. – Zsar Jun 03 '16 at 23:45
  • @Zsar: I agree the answers there are all terrible, but the question doesn't have any x86 asm baggage. I'll probably write an answer there once I'm finished with this question. Or maybe a new Q&A is appropriate. It's harder than you'd expect to express it in C that can auto-vectorize. (I assume you actually have a whole array of floats, right? [`CVTPS2DQ`](http://www.felixcloutier.com/x86/CVTPS2DQ.html) converts 4 at once (or 8 with AVX), and you can pack 4 results down to 8bit elements to save store bandwidth). – Peter Cordes Jun 03 '16 at 23:53
  • @PeterCordes : Sadly no such array. Single `floats` undergo iterative operations wherein fractions have to accumulate, but input and eventual output have to be integrals. This happens far and wide in the code, with many different semantics to it. It's organically grown game code, oldest parts of which seem to predate the first C++ standard. I find the oddest things in there, hrhr, but this one seemed at least _possibly_ still relevant. – Zsar Jun 04 '16 at 00:04

1 Answers1

4

TL;DR: use lrintf(x) or (int)nearbyintf(x), depending on which one your compiler likes better.

Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem or penryn, or later), with or without -ffast-math. You may need -fno-math-errno to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf or sqrtf or other math functions to set errno, and is generally recommended along with -fno-trapping-math.


Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld a constant and fistp it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).

The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.

In this case, we can get the compiler to emit good code from pure C, so we should absolutely do that.

Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno can help, if you can't use the full -ffast-math or the MSVC equivalent)

With some compilers (gcc but not clang), lrintf is good. It isn't ideal, though: float->long->int isn't the same as directly to int when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long.

64bit long changes the overflow semantics for lrint: instead of getting 0x80000000 (on x86 with SSE instructions), you'll get the low 32bits of the long (which will be all-zero if the value was outside the range of a long).

This lrintf won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert floats or double to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int, but you can use (int)nearbyintf(x), which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.

Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0 on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.

On AArch64 (aka ARM64), gcc4.8 compiles lround into a single fcvtas x0, s0 instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a. Maybe those aren't the right options; IDK ARM very well :/

If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32 (cvtps2dq), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.


lrintf

#include <math.h>
int round_to_nearest(float f) {  // default mode is always nearest
  return lrintf(f);
}

Compiler output from the Godbolt Compiler explorer:

       ########### Without -ffast-math #############
    cvtss2si        eax, xmm0    # gcc 6.1  (-O3 -mx32, so long is 32bit)

    cvtss2si        rax, xmm0    # gcc 4.4 through 6.1  (-O3).  can't auto-vectorize, though.

    jmp     lrintf               # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/

             ###### With -ffast-math #########
    jmp     lrintf               # clang 3.8 (-O3 -msse4.1 -ffast-math)

So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math.


Don't use roundf/lroundf: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.

If you want the return value as a float, instead of converting to int, it may be better to use nearbyintf. rint has to raise the FP inexact exception when output != input. (But SSE4.1 roundss can implement either behaviour with bit 3 of its immediate control byte).


truncating nearbyint() to int directly.

#include <math.h>
int round_to_nearest(float f) {
  return nearbyintf(f);
}

Compiler output from the Godbolt Compiler explorer.

        ########  With -ffast-math ############
    cvtss2si        eax, xmm0      # gcc 4.8 through 6.1 (-O3 -ffast-math)

    # clang is dumb and won't fold the roundss into the cvt.  Without sse4.1, it's a function call
    roundss xmm0, xmm0, 12         # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
    cvttss2si       eax, xmm0

    roundss   xmm1, xmm0, 12      # ICC13 (-O3 -msse4.1 -ffast-math)
    cvtss2si  eax, xmm1

        ######## WITHOUT -ffast-math ############
    sub     rsp, 8
    call    nearbyintf                    # gcc 6.1 (-O3 -msse4.1)
    add     rsp, 8                        # and clang without -msse4.1
    cvttss2si       eax, xmm0

    roundss xmm0, xmm0, 12               # clang3.2 and later (-O3 -msse4.1)
    cvttss2si       eax, xmm0

    roundss   xmm1, xmm0, 12             # ICC13 (-O3 -msse4.1)
    cvtss2si  eax, xmm1

Gcc 4.7 and earlier: Just cvttss2si without -msse4.1, but emits a roundss if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.


How it works in asm

Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.

That's only true if you're targeting 20-year-old CPUs without SSE. (You said float, not double, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).

Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si) or with the current counting mode (cvtss2si). (Note the extra t for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double, and x86-64 allows the destination to be a 64bit integer register.

See also the tag wiki.

cvtss2si basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.

I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float / double args in xmm registers).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • While this answer makes the original consideration of this point obsolete: Any insight into the persistence/initial behaviour of `std::fesetround`? Can I assume `round-to-nearest` unless my program explicitely changes it? – Zsar Jun 03 '16 at 23:37
  • @Zsar: I think so. IIRC, the C/C++ standards require it to be the default setting at the beginning of `main()`. Any change to the rounding mode should take effect immediately (i.e. I think compilers aren't allowed to reorder FP calculations so they happen on the wrong side of a change in rounding mode. So it's like a compiler barrier). I'm not sure about this, though; I've never wanted to change rounding modes. (In asm, changing rounding modes is definitely "synchronous". The cardinal rule of out-of-order execution is that single-threaded code always appears to run in program order.) – Peter Cordes Jun 03 '16 at 23:46
  • Works for me. Great answer all along. Also really nice link to this compiler explorer thingy. Did not know something like this existed. Thank you very much. – Zsar Jun 03 '16 at 23:52
  • @Zsar: I just realized, the `roundss` / `cvtt` in some cases might be due to clang not assuming anything about the current rounding mode. (One of the main features of the SSE4.1 `round` instructions is independence from the current rounding mode, by specifying it in the imm8 byte). But I think `-ffast-math` is supposed to let it assume that. I'm glad you brought that up, because I wasn't really thinking about the difference between `lroundf()`'s fixed rounding behaviour, vs. `lrintf`'s use of the current rounding mode (which enables `cvtss2si`). – Peter Cordes Jun 04 '16 at 00:01
  • @Zsar: I think I'm done working on this answer now; having said everything there is to say. Now it's time to start thinking about which parts of this to include in a new canonical `[c]` or `[c++]` Q&A, if there isn't a better one than that one I found linked earlier. – Peter Cordes Jun 04 '16 at 02:16
  • One more point of interest for a canonical Q&A might be whether an unsigned target makes a difference. E.g. this part `SSE has instructions to convert a scalar float to signed int [...]` does seem to imply that this cannot happen. Older questions on this site, e.g. [this one](http://stackoverflow.com/a/29856283/3434465) would quickly cement this impression in the uninformed reader. Indeed, looking at a list of SSE instructions as e.g. [here](http://softpixel.com/~cwright/programming/simd/sse.php), one must notice that the "signed"/"unsigned" specifiers are absent from the conversion functions. – Zsar Jun 06 '16 at 14:26
  • @Zsar: yes, converting to `uint32_t` isn't easy until AVX512. Converting to narrower unsigned ints is easy, since there are pack instructions that saturate to unsigned, as that linked question shows. Converting to `int64_t` and converting that down to `uint32_t` is probably the only good approach. [gcc branches on the sign of the FP value for `float` or `double` to `uint64_t`, conditionally flipping the MSB of the signed result](https://godbolt.org/g/V146ol). – Peter Cordes Jun 06 '16 at 15:20
  • But compilers probably still make optimal code from casting `nearbyint()` – Peter Cordes Jun 06 '16 at 15:21
  • You forgot to mention that `lrintf` is only optimized to the single asm instruction when `-fno-math-errno` is passed which is part of `-ffast-math`. This makes your comparision on with/without the latter confusing – Flamefire Aug 11 '19 at 09:12
  • @Flamefire: I added a mention in the TL:DR. Feel free to edit if there's a specific other place you think it should be mentioned; I didn't re-read this entire answer. According to my last Godbolt link, clang does still inline lrintf without it. – Peter Cordes Aug 11 '19 at 09:19
  • Intresting. I couldn't get clang to inline it but then noticed your -march=nehalem which seems to do that – Flamefire Aug 11 '19 at 09:34
  • @Flamefire: `roundss` is only available with SSE4.1, not baseline x86-64 (SSE2). I'm using `-march=nehalem` to enable that (and SSE4.2 and set tuning options). – Peter Cordes Aug 11 '19 at 09:40