The actual fastest implementation for a large array on modern x86 CPUs would be
- change the MXCSR FP rounding mode to round towards -Infinity (aka
floor
). In C, this should be possible with fenv
stuff, or _mm_getcsr
/ _mm_setcsr
.
loop over the array doing _mm_cvtps_epi32
on SIMD vectors, converting 4 float
s to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi]
is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
- restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x
required changing the x87 rounding mode to truncation and then back. cvttps2dq
for packed float->int with truncation (note the extra t
in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si
or cvttsd2si
for scalar double
to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps
to round a float
to the nearest integer float
using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno
, and the right -march
or -msse4
options, compilers can inline floor
using roundps
, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1
, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd
for floor
even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell
. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)