TL;DR: use lrintf(x)
or (int)nearbyintf(x)
, depending on which one your compiler likes better.
Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem
or penryn, or later), with or without -ffast-math
. You may need -fno-math-errno
to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf
or sqrtf
or other math functions to set errno
, and is generally recommended along with -fno-trapping-math
.
Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld
a constant and fistp
it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32
, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).
The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.
In this case, we can get the compiler to emit good code from pure C, so we should absolutely do that.
Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno
can help, if you can't use the full -ffast-math
or the MSVC equivalent)
With some compilers (gcc but not clang), lrintf
is good. It isn't ideal, though: float
->long
->int
isn't the same as directly to int
when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long
.
64bit long
changes the overflow semantics for lrint
: instead of getting 0x80000000
(on x86 with SSE instructions), you'll get the low 32bits of the long
(which will be all-zero if the value was outside the range of a long
).
This lrintf
won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert float
s or double
to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int
, but you can use (int)nearbyintf(x)
, which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.
Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0
on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.
On AArch64 (aka ARM64), gcc4.8 compiles lround
into a single fcvtas x0, s0
instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math
makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a
. Maybe those aren't the right options; IDK ARM very well :/
If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32
(cvtps2dq
), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw
. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.
lrintf
#include <math.h>
int round_to_nearest(float f) { // default mode is always nearest
return lrintf(f);
}
Compiler output from the Godbolt Compiler explorer:
########### Without -ffast-math #############
cvtss2si eax, xmm0 # gcc 6.1 (-O3 -mx32, so long is 32bit)
cvtss2si rax, xmm0 # gcc 4.4 through 6.1 (-O3). can't auto-vectorize, though.
jmp lrintf # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/
###### With -ffast-math #########
jmp lrintf # clang 3.8 (-O3 -msse4.1 -ffast-math)
So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math
.
Don't use roundf
/lroundf
: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.
If you want the return value as a float
, instead of converting to int, it may be better to use nearbyintf
. rint
has to raise the FP inexact exception when output != input. (But SSE4.1 roundss
can implement either behaviour with bit 3 of its immediate control byte).
truncating nearbyint()
to int
directly.
#include <math.h>
int round_to_nearest(float f) {
return nearbyintf(f);
}
Compiler output from the Godbolt Compiler explorer.
######## With -ffast-math ############
cvtss2si eax, xmm0 # gcc 4.8 through 6.1 (-O3 -ffast-math)
# clang is dumb and won't fold the roundss into the cvt. Without sse4.1, it's a function call
roundss xmm0, xmm0, 12 # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1 -ffast-math)
cvtss2si eax, xmm1
######## WITHOUT -ffast-math ############
sub rsp, 8
call nearbyintf # gcc 6.1 (-O3 -msse4.1)
add rsp, 8 # and clang without -msse4.1
cvttss2si eax, xmm0
roundss xmm0, xmm0, 12 # clang3.2 and later (-O3 -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1)
cvtss2si eax, xmm1
Gcc 4.7 and earlier: Just cvttss2si
without -msse4.1
, but emits a roundss
if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.
How it works in asm
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.
That's only true if you're targeting 20-year-old CPUs without SSE. (You said float
, not double
, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).
Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si
) or with the current counting mode (cvtss2si
). (Note the extra t
for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double
, and x86-64 allows the destination to be a 64bit integer register.
See also the x86 tag wiki.
cvtss2si
basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.
I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float
/ double
args in xmm registers).