GCC keeps complaining "error: incorrect rounding operand" for a AVX512 functions _mm512_cvt_roundpd_epi64

Question

I am using _mm512_cvt_roundpd_epi64 and keep getting complier error as:

/dump/1/alicpp2/built/gcc-7.3.0-7u2/gcc-7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0/include/avx512dqintrin.h:1574:14: error: incorrect rounding operand __R);

Here is my code:

    #include <iostream>
    #include <immintrin.h>

    void Date64Align(int64_t* dst, int64_t* src, size_t length) {
      constexpr int dop = 512 / 64;
      int64_t starting_epoch_milliseconds_ = 1513728000;
      int32_t granularity_milliseconds_ = 3600;

      __m512i start = _mm512_set1_epi64(starting_epoch_milliseconds_);
      __m512i granularity = _mm512_set1_epi64(granularity_milliseconds_);

      double temp = (double)granularity_milliseconds_;
      __m512d granularity_double = _mm512_set1_pd(temp);

      for (int i = 0; i < length / dop; ++i) {
        // load the src (load X into SIMD register
        __m512i data = _mm512_load_epi64(src);
        // X - starting_epoch_milliseconds_
        data = _mm512_sub_epi64(data, start);
        // convert X to double
        __m512d double_data;
        double_data = _mm512_cvt_roundepi64_pd(data, _MM_FROUND_TO_NEAREST_INT);

        // X = X / Y in double
        double_data = _mm512_div_pd(double_data, granularity_double);

        // Convert X to int64
        data = _mm512_cvt_roundpd_epi64(double_data, _MM_FROUND_NO_EXC);

        data = _mm512_mullo_epi64(data, granularity);

        // store X
        _mm512_store_epi64(dst, data);

        src += dop;
        dst += dop;
      }
    }

    int main() {
      return 0;
    }

And my CMakeFileLists.txt:

    cmake_minimum_required(VERSION 3.11)
    project(untitled3)

    set(CMAKE_CXX_STANDARD 17)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -msse4.2 -mavx512f - 
    mavx512dq")


    add_executable(untitled3 main.cpp)

Is anyone familiar with the AVX512 library and help to answer my question, please?

Can you post a [mcve] that I could actually copy/paste into http://gcc.godbolt.org/ to take a look? Did you `#include ` and compile with `-march=native` or `-march=skylake-avx512`? — Peter Cordes, Aug 24 '18 at 04:26
Are you sure you can't use a power-of-2 granularity so you can just shift instead of using a relatively slow `mullo_epi64`? (And multiply by an inverse instead of dividing.) Unfortunately `_mm512_roundscale_pd` won't work, it doesn't allow a negative number of fraction bits. ([instrinsic \_mm512\_round\_ps is missing for AVX512](https://stackoverflow.com/q/50854991)) — Peter Cordes, Aug 24 '18 at 04:35
Thank you for your reply. I updated the code with a minimal example and a CMakeFileLists.txt is attached. Would you help to take a look, please? — j.z., Aug 24 '18 at 07:52

Peter Cordes · Accepted Answer · 2018-08-26T19:50:09.560

FYI, you normally don't need explicit round-to-nearest. The default mode is round-to-nearest, with all exceptions masked. Plain _mm512_cvtepi64_pd and _mm512_cvtpd_epi64 will behave identically to what you're doing unless you've changed the default rounding mode or exception-mask in this thread with fenv or _MM_SET_ROUNDING_MODE.

Suppressing exceptions only means they don't fault, but it doesn't stop a subnormal or overflow from setting the relevant sticky status bit in MXCSR, if I'm reading Intel's manual correctly. They say it's just like having the masking bits set in MXCSR, not that it prevents an exception from being recorded at all in MXCSR status bits.

A more common use-case for _mm512_cvt_roundpd_epi64 would be to convert to integer with floor or ceil rounding (towards -/+Infinity), instead of a separate rounding step before converting like you need with 128-bit or 256-bit vectors.

But if you are running with some FP exceptions unmasked or a possibly-non-default rounding mode, then explicit round-to-nearest does make sense.

Rounding-mode overrides must always include `_MM_FROUND_NO_EXC`

It would be good if compilers provided better error messages which told you this. (TODO: file feature-request bug reports on gcc and clang).

(_MM_FROUND_CUR_DIRECTION doesn't count, it means "no override", same as if you'd used the normal non-round version of the intrinsic.)

Intel's intrinsics guide points this out (in the entry for _mm512_cvt_roundepi64_pd specifically, but you'll find the same in every intrinsic that takes a rounding-mode override arg.)

Rounding is done according to the rounding parameter, which can be one of:

(_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) // round to nearest, and suppress exceptions
(_MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC)     // round down, and suppress exceptions
(_MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC)     // round up, and suppress exceptions
(_MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC)        // truncate, and suppress exceptions
_MM_FROUND_CUR_DIRECTION // use MXCSR.RC; see _MM_SET_ROUNDING_MODE

Note that _MM_FROUND_NO_EXC on its own happens to be valid because _MM_FROUND_TO_NEAREST_INT happens to be 0, same as in the machine-code encoding for the 2-bit rounding-mode field when EVEX.b is set. But you should really make that explicit in your _mm512_cvt_roundpd_epi64.

For instructions which don't have rounding-control, like _mm512_cvtt_roundpd_epi64 (note the extra t for truncate), only _MM_FROUND_NO_EXC (or _MM_FROUND_CUR_DIRECTION) is allowed, because the behaviour isn't affected by the value of the 2-bit field, just whether or not a rounding override was specified.

In the machine encoding for the EVEX prefix, setting a rounding-mode override implies SAE (suppress all exceptions). There's no way to encode an _MM_FROUND_TO_NEAREST_INT override without also suppressing exceptions.

From Intel's vol.2 instruction set reference manual:

2.6.8 Static Rounding Support in EVEX

Static rounding control embedded in the EVEX encoding system applies only to register-to-register flavor of floating-point instructions with rounding semantic at two distinct vector lengths: (i) scalar, (ii) 512-bit. In both cases, the field EVEX.L’L expresses rounding mode control overriding MXCSR.RC if EVEX.b is set. When EVEX.b is set, “suppress all exceptions” is implied.

Notice that rounding overrides make it impossible for the compiler to use a memory source operand, because the EVEX.b bit means broadcast vs. non-broadcast in that context.

Not a problem in your case; the data is coming from a _mm512_sub_epi64, but worth pointing out in general that an override to the rounding mode that's already the default can have a minor performance penalty by requiring an extra load instruction in some cases where it wouldn't otherwise be needed. Static rounding is always better than an extra _mm512_roundscale_pd, though (instrinsic _mm512_round_ps is missing for AVX512).

BTW, these restrictions (only for scalar or 512-bit vectors, and only non-memory instructions) are why it makes sense for AVX512 to have vcvttpd2qq at all, instead of just using _MM_FROUND_TO_ZERO|_MM_FROUND_NO_EXC for _mm512_cvt_roundpd_epi64. Because there is no _mm256_cvt_roundpd_epi64, and it's occasionally nice if the compiler can fold a load into a memory operand for vcvttpd2qq.

There's also historical precedent: since SSE1 cvttss2si and cvttps2dq, Intel has had truncating conversions which make it much more efficient to implement C's FP->int cast semantics without changing the MXCSR rounding mode the way we used to have to with x87 (before SSE3 fisttp).

Before AVX512, there was never support for packed conversions involving 64-bit integers, so there was no existing 128-bit or 256-bit version of that instruction. It was a good design decision to provide one, though.

Rounding overrides are new in AVX512. Before that, packed rounding-to-integer (with input and output both being __m128 or __m128d) with an explicit mode was possible with SSE4.1 roundps / roundpd.

Alternate implementations for more efficiency:

Add instead of sub:

__m512i minus_start = _mm512_set1_epi64(-starting_epoch_milliseconds_);

 for(){ 
    __m512i data = _mm512_add_epi64(data, minus_start);
 }

add is commutative, so the compiler can fold the load into a load+add instruction like vpaddq zmm0, zmm8, [rdi], instead of a separate load+sub. clang does this optimization for you, but gcc doesn't

It looks like you want to round your input integers to the nearest multiple of 3600.

Replace divide with multiply

1.0/3600 rounded to the nearest double is about 2.777777777777777775368439616699e-04, which is only wrong by at most 0.5 parts in 2^53 (the significand precision of double). That's about 10^-16. For inputs smaller than that, lrint(x * (1.0/3600)) is within 1 of lrint(x / 3600.0). For most reasonable-sized inputs, they're exactly equal.

You will still always get an exact multiple of 3600 after multiplying, but with a tiny error in "division" you could be off by one times 3600 in the end.

You could write a test program to find cases where you get different results from division vs. multiplication by an inverse.

Can you do this as part of another pass over the data? It's not much computation for all that memory bandwidth. Or if you can't replace the div_pd with a multiply by the inverse, it totally bottlenecks on FP division without keeping other execution units busy.

There are three strategies here:

pure integer, using a multiplicative inverse for exact division. Why does GCC use multiplication by a strange number in implementing integer division?. Evan AVX512DQ doesn't have an integer multiply that gives you the high half of a 64x64 => 128, only vpmullq 64x64 => 64-bit (and it's multiple uops).

Without AVX512IFMA VPMADD52HUQ (the high half of a 52x52=>52-bit multiply), see Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?. (Or if you actually only care about the low 32 bits of your input, then 32x32=>64 bit multiply and 64 bit shift should work, using _mm512_mul_epu32, single-uop vpmuludq.) But this would also take extra work to round to nearest instead of truncating.
what you're doing now: double divide (or multiply by inverse), convert to nearest int64_t, 64-bit integer multiply.

Input might get rounded to nearest double if > 2^53, but final result will always be an exact multiple of 3600 (unless the multiply overflows int64_t).
double divide (or multiply), round to nearest integer (without converting), double multiply, convert to integer.

The result of the last multiply could be a problem if it's above 2^(53+4). 3600 is a multiple of 2^4 but not of 2^5. So rounding to the nearest representable double might give a number that isn't an exact multiple of 3600, for very large inputs.

If range limits aren't a problem, you could even fold the subtract in using an fma(val, 3600, -3600.0*start).

SIMD FP multiply has significantly better throughput than integer multiply, so it might be a win overall, even with the extra cost of an FP round-to-nearest instruction.

You can sometimes avoid explicit rounding instructions by adding then subtracting a large constant, like @Mysticial does in Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?. You make the value big enough that it the nearest representable doubles are whole integers. (How to efficiently perform double/int64 conversions with SSE/AVX?, for limited-range inputs, also shows some FP manipulation tricks.)

Maybe we can rounded=fma(v, 1.0/3600, round_constant), then subtract round_constant to get a value rounded to the nearest integer without _mm512_roundscale_pd. We could even do fma(rounded, 3600, -3600*round_constant) to fold it into scaling back up: 2^52 * 3600 = 4503599627370496.0 * 3600 is exactly representable as a double.

There might be a double-rounding problem: first when converting from int64_t to the nearest double (if it's so big that the integer isn't exactly representable), then again when dividing and rounding to the nearest integer.

Costs: I'm assuming you can replace FP division with a multiply by 1.0/3600.

fp mul, convert, integer mul: vcvtqq2pd(1 uop for FMA ports) + vmulpd(1 uop) + vcvtpd2qq (1 uop) + vpmullq (3 uops for FMA ports) = 6 uops for the 2 FMA ports. vpsubq zmm also competes for the same ports, so 7 really. SKX uop counts from Agner Fog's testing.
fp everything: vcvtqq2pd(1 uop for FMA ports) + vmulpd(1 uop) + vrndscalepd (2 uops) + vmulpd (1 uop) + vcvtpd2qq (1 uop) = 6 uops again, but maybe lower latency. (vrndscale+vmulpd is 8+4 latency, faster than vpmullq 15 cycle latency). But OoO exec should easily hide this latency if looping over an array for independent vectors, so saving latency isn't a huge deal.

I'm not sure how efficient you can make an "integer" multiply, or use FP bithacks to avoid convert instructions. If this is performance critical, that might be worth investigating.

Appreciate a lot for your help, Peter! I changed the Mode and it works! — j.z., Aug 24 '18 at 21:18
@user10267770: BTW, I updated with some suggestions on making it more efficient. By far the most important is replacing div with mul, if that's accurate enough. Like a factor of 5 or so (`16/3`) because of one per 16c `vdivpd zmm`. — Peter Cordes, Aug 26 '18 at 21:19
Thanks for your further advanced help, Peter! I replaced the div with mul, but there is no performance gain here. There is even a tiny loss. By the way, I am not round the number to the nearest multiple of 3600. I am rounding DOWN the number to the multiple of 3600(can be any other number). And the number is epoch time in milliseconds, so 2^52 is good enough for my whole life. — j.z., Aug 30 '18 at 05:12
@user10267770: If you wanted to round down, why did you use `_mm512_cvt_roundpd_epi64` with `_MM_FROUND_NO_EXC` (giving round-to-nearest)? I guess you meant to use `_mm512_cvtt_roundpd_epi64` (truncating), or a `_MM_FROUND_TO_ZERO` rounding override? Anyway, are you sure you compiled with optimization enabled? It's not plausible that calculating `1./3600` outside the loop and then only multiplying inside is slower than what you're doing now. — Peter Cordes, Aug 30 '18 at 05:28

GCC keeps complaining "error: incorrect rounding operand" for a AVX512 functions _mm512_cvt_roundpd_epi64

1 Answers1

Rounding-mode overrides must always include _MM_FROUND_NO_EXC

Alternate implementations for more efficiency:

Replace divide with multiply

Rounding-mode overrides must always include `_MM_FROUND_NO_EXC`