FYI, you normally don't need explicit round-to-nearest. The default mode is round-to-nearest, with all exceptions masked. Plain _mm512_cvtepi64_pd
and _mm512_cvtpd_epi64
will behave identically to what you're doing unless you've changed the default rounding mode or exception-mask in this thread with fenv
or _MM_SET_ROUNDING_MODE
.
Suppressing exceptions only means they don't fault, but it doesn't stop a subnormal or overflow from setting the relevant sticky status bit in MXCSR, if I'm reading Intel's manual correctly. They say it's just like having the masking bits set in MXCSR, not that it prevents an exception from being recorded at all in MXCSR status bits.
A more common use-case for _mm512_cvt_roundpd_epi64
would be to convert to integer with floor
or ceil
rounding (towards -/+Infinity), instead of a separate rounding step before converting like you need with 128-bit or 256-bit vectors.
But if you are running with some FP exceptions unmasked or a possibly-non-default rounding mode, then explicit round-to-nearest does make sense.
Rounding-mode overrides must always include _MM_FROUND_NO_EXC
It would be good if compilers provided better error messages which told you this. (TODO: file feature-request bug reports on gcc and clang).
(_MM_FROUND_CUR_DIRECTION
doesn't count, it means "no override", same as if you'd used the normal non-round
version of the intrinsic.)
Intel's intrinsics guide points this out (in the entry for _mm512_cvt_roundepi64_pd
specifically, but you'll find the same in every intrinsic that takes a rounding-mode override arg.)
Rounding is done according to the rounding parameter, which can be one
of:
(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions
(_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) // round down, and suppress exceptions
(_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) // round up, and suppress exceptions
(_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) // truncate, and suppress exceptions
_MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE
Note that _MM_FROUND_NO_EXC
on its own happens to be valid because _MM_FROUND_TO_NEAREST_INT
happens to be 0
, same as in the machine-code encoding for the 2-bit rounding-mode field when EVEX.b
is set. But you should really make that explicit in your _mm512_cvt_roundpd_epi64
.
For instructions which don't have rounding-control, like _mm512_cvtt_roundpd_epi64
(note the extra t
for truncate), only _MM_FROUND_NO_EXC
(or _MM_FROUND_CUR_DIRECTION
) is allowed, because the behaviour isn't affected by the value of the 2-bit field, just whether or not a rounding override was specified.
In the machine encoding for the EVEX prefix, setting a rounding-mode override implies SAE (suppress all exceptions). There's no way to encode an _MM_FROUND_TO_NEAREST_INT
override without also suppressing exceptions.
From Intel's vol.2 instruction set reference manual:
2.6.8 Static Rounding Support in EVEX
Static rounding control embedded in the EVEX encoding system applies
only to register-to-register flavor of floating-point instructions
with rounding semantic at two distinct vector lengths: (i) scalar,
(ii) 512-bit. In both cases, the field EVEX.L’L
expresses rounding
mode control overriding MXCSR.RC
if EVEX.b
is set. When EVEX.b
is set,
“suppress all exceptions” is implied.
Notice that rounding overrides make it impossible for the compiler to use a memory source operand, because the EVEX.b
bit means broadcast vs. non-broadcast in that context.
Not a problem in your case; the data is coming from a _mm512_sub_epi64
, but worth pointing out in general that an override to the rounding mode that's already the default can have a minor performance penalty by requiring an extra load instruction in some cases where it wouldn't otherwise be needed. Static rounding is always better than an extra _mm512_roundscale_pd
, though (instrinsic _mm512_round_ps is missing for AVX512).
BTW, these restrictions (only for scalar or 512-bit vectors, and only non-memory instructions) are why it makes sense for AVX512 to have vcvttpd2qq
at all, instead of just using _MM_FROUND_TO_ZERO|_MM_FROUND_NO_EXC
for _mm512_cvt_roundpd_epi64
. Because there is no _mm256_cvt_roundpd_epi64
, and it's occasionally nice if the compiler can fold a load into a memory operand for vcvttpd2qq
.
There's also historical precedent: since SSE1 cvttss2si
and cvttps2dq
, Intel has had truncating conversions which make it much more efficient to implement C's FP->int cast semantics without changing the MXCSR rounding mode the way we used to have to with x87 (before SSE3 fisttp
).
Before AVX512, there was never support for packed conversions involving 64-bit integers, so there was no existing 128-bit or 256-bit version of that instruction. It was a good design decision to provide one, though.
Rounding overrides are new in AVX512. Before that, packed rounding-to-integer (with input and output both being __m128
or __m128d
) with an explicit mode was possible with SSE4.1 roundps
/ roundpd
.
Alternate implementations for more efficiency:
Add instead of sub:
__m512i minus_start = _mm512_set1_epi64(-starting_epoch_milliseconds_);
for(){
__m512i data = _mm512_add_epi64(data, minus_start);
}
add is commutative, so the compiler can fold the load into a load+add instruction like vpaddq zmm0, zmm8, [rdi]
, instead of a separate load+sub. clang does this optimization for you, but gcc doesn't
It looks like you want to round your input integers to the nearest multiple of 3600.
Replace divide with multiply
1.0/3600
rounded to the nearest double
is about 2.777777777777777775368439616699e-04
, which is only wrong by at most 0.5 parts in 2^53 (the significand precision of double
). That's about 10^-16. For inputs smaller than that, lrint(x * (1.0/3600))
is within 1 of lrint(x / 3600.0)
. For most reasonable-sized inputs, they're exactly equal.
You will still always get an exact multiple of 3600 after multiplying, but with a tiny error in "division" you could be off by one times 3600 in the end.
You could write a test program to find cases where you get different results from division vs. multiplication by an inverse.
Can you do this as part of another pass over the data? It's not much computation for all that memory bandwidth. Or if you can't replace the div_pd
with a multiply by the inverse, it totally bottlenecks on FP division without keeping other execution units busy.
There are three strategies here:
pure integer, using a multiplicative inverse for exact division. Why does GCC use multiplication by a strange number in implementing integer division?.
Evan AVX512DQ doesn't have an integer multiply that gives you the high half of a 64x64 => 128, only vpmullq
64x64 => 64-bit (and it's multiple uops).
Without AVX512IFMA VPMADD52HUQ
(the high half of a 52x52=>52-bit multiply), see Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?.
(Or if you actually only care about the low 32 bits of your input, then 32x32=>64 bit multiply and 64 bit shift should work, using _mm512_mul_epu32
, single-uop vpmuludq
.) But this would also take extra work to round to nearest instead of truncating.
what you're doing now: double
divide (or multiply by inverse), convert to nearest int64_t
, 64-bit integer multiply.
Input might get rounded to nearest double
if > 2^53, but final result will always be an exact multiple of 3600 (unless the multiply overflows int64_t
).
double
divide (or multiply), round to nearest integer (without converting), double
multiply, convert to integer.
The result of the last multiply could be a problem if it's above 2^(53+4). 3600 is a multiple of 2^4 but not of 2^5. So rounding to the nearest representable double
might give a number that isn't an exact multiple of 3600, for very large inputs.
If range limits aren't a problem, you could even fold the subtract in using an fma(val, 3600, -3600.0*start)
.
SIMD FP multiply has significantly better throughput than integer multiply, so it might be a win overall, even with the extra cost of an FP round-to-nearest instruction.
You can sometimes avoid explicit rounding instructions by adding then subtracting a large constant, like @Mysticial does in Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?. You make the value big enough that it the nearest representable double
s are whole integers. (How to efficiently perform double/int64 conversions with SSE/AVX?, for limited-range inputs, also shows some FP manipulation tricks.)
Maybe we can rounded=fma(v, 1.0/3600, round_constant)
, then subtract round_constant
to get a value rounded to the nearest integer without _mm512_roundscale_pd
. We could even do fma(rounded, 3600, -3600*round_constant)
to fold it into scaling back up: 2^52 * 3600 = 4503599627370496.0 * 3600
is exactly representable as a double
.
There might be a double-rounding problem: first when converting from int64_t
to the nearest double
(if it's so big that the integer isn't exactly representable), then again when dividing and rounding to the nearest integer.
Costs: I'm assuming you can replace FP division with a multiply by 1.0/3600
.
fp mul, convert, integer mul: vcvtqq2pd
(1 uop for FMA ports) + vmulpd
(1 uop) + vcvtpd2qq
(1 uop) + vpmullq
(3 uops for FMA ports) = 6 uops for the 2 FMA ports. vpsubq zmm
also competes for the same ports, so 7 really. SKX uop counts from Agner Fog's testing.
fp everything: vcvtqq2pd
(1 uop for FMA ports) + vmulpd
(1 uop) + vrndscalepd
(2 uops) + vmulpd
(1 uop) + vcvtpd2qq
(1 uop) = 6 uops again, but maybe lower latency. (vrndscale+vmulpd is 8+4 latency, faster than vpmullq 15 cycle latency). But OoO exec should easily hide this latency if looping over an array for independent vectors, so saving latency isn't a huge deal.
I'm not sure how efficient you can make an "integer" multiply, or use FP bithacks to avoid convert instructions. If this is performance critical, that might be worth investigating.