adding "-march=native" intel compiler flag to the compilation line leads to a floating point exception on KNL

Question

I have a code, which i launch on Intel Xeon Phi Knights Landing (KNL) 7210 (64 cores) processor (it is a PC, in native mode) and use the Intel c++ compiler (icpc) version 17.0.4. Also i launch the same code on Intel core i7 processor, where the version of icpc is 17.0.1. To be more correct, i compile the code on the machine i'm launching it (compiled on i7 and launched on i7, the same for KNL). I never make the binary file on one machine and bring it to another. The loops are parallelized and vectorized using OpenMP. For best performance i use the intel compiler flags:

-DCMAKE_CXX_COMPILER="-march=native -mtune=native -ipo16 -fp-model fast=2 -O3 -qopt-report=5 -mcmodel=large"

On i7 everything works well. But on KNL the code works withous -march=native and if to add this option the program throws floating point exception immediately. If to compile with the only flag "-march=native" the situation is the same. If to use gdb, it points at the line pp+=alpha/rd of the piece of code:

...

the code above is run in 1 thread

double K1=0.0, P=0.0;

#pragma omp parallel for reduction(+:P_x,P_y,P_z, K1,P)
for(int i=0; i<N; ++i)
{
  P_x+=p[i].vx*p[i].m;
  P_y+=p[i].vy*p[i].m;
  P_z+=p[i].vz*p[i].m;
  K1+=p[i].vx*p[i].vx+p[i].vy*p[i].vy+p[i].vz*p[i].vz;
  float pp=0.0;
#pragma simd reduction(+:pp)
  for(int j=0; j<N; ++j) if(i!=j)
  {
    float rd=sqrt((p[i].x-p[j].x)*(p[i].x-p[j].x)+(p[i].y-p[j].y)*(p[i].y-p[j].y)+(p[i].z-p[j].z)*(p[i].z-p[j].z));
    pp+=alpha/rd;
  }
  P+=pp;
}
...

Particle p[N]; - an array of particles, Particle is a structure of floats. N - maximal number of particles.

If to remove the flag -march=native or replace it with -march=knl or with -march=core-avx2, everything woks OK. This flag is doing something bad to the program, but what - I don't know.

I found in the Internet (https://software.intel.com/en-us/articles/porting-applications-from-knights-corner-to-knights-landing, https://math-linux.com/linux/tip-of-the-day/article/intel-compilation-for-mic-architecture-knl-knights-landing) that one should use the flags: -xMIC-AVX512. I tried to use this flag and -axMIC-AVX512, but they give the same error.

So, what i wanted to ask is:

Why -march=native, -xMIC-AVX512 do not work and -march=knl works; is -xMIC-AVX512 included in -march=native flag for KNL?
May I replace the flag -march=native with -march=knl when I launch the code on KNL (on i7 everything works), are they equivalent?
Is the set of flags written optimal for the best performance if using Intel compiler?

As, Peter Cordes told, i placed here the assembeler output when the program throws Floating Point Exception in GDB: 1) the output of (gdb) disas:

Program received signal SIGFPE, Arithmetic exception.
0x000000000040e3cc in randomizeBodies() ()
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5- 
16.el7.x86_64 libstdc++-4.8.5-16.el7.x86_64
(gdb) disas
Dump of assembler code for function _Z15randomizeBodiesv:
0x000000000040da70 <+0>:    push   %rbp
0x000000000040da71 <+1>:    mov    %rsp,%rbp
0x000000000040da74 <+4>:    and    $0xffffffffffffffc0,%rsp
0x000000000040da78 <+8>:    sub    $0x100,%rsp
0x000000000040da7f <+15>:   vpxor  %xmm0,%xmm0,%xmm0
0x000000000040da83 <+19>:   vmovups %xmm0,(%rsp)
0x000000000040da88 <+24>:   vxorpd %xmm5,%xmm5,%xmm5
0x000000000040da8c <+28>:   vmovq  %xmm0,0x10(%rsp)
0x000000000040da92 <+34>:   mov    $0x77359400,%ecx
0x000000000040da97 <+39>:   xor    %eax,%eax
0x000000000040da99 <+41>:   movabs $0x5deece66d,%rdx
0x000000000040daa3 <+51>:   mov    %ecx,%ecx
0x000000000040daa5 <+53>:   imul   %rdx,%rcx
0x000000000040daa9 <+57>:   add    $0xb,%rcx
0x000000000040daad <+61>:   mov    %ecx,0x9a3b00(,%rax,8)
0x000000000040dab4 <+68>:   mov    %ecx,%esi
0x000000000040dab6 <+70>:   imul   %rdx,%rsi
0x000000000040daba <+74>:   add    $0xb,%rsi
0x000000000040dabe <+78>:   mov    %esi,0x9e3d00(,%rax,8)
0x000000000040dac5 <+85>:   mov    %esi,%edi
0x000000000040dac7 <+87>:   imul   %rdx,%rdi
0x000000000040dacb <+91>:   add    $0xb,%rdi
0x000000000040dacf <+95>:   mov    %edi,0xa23f00(,%rax,8)
0x000000000040dad6 <+102>:  mov    %edi,%r8d
0x000000000040dad9 <+105>:  imul   %rdx,%r8
0x000000000040dadd <+109>:  add    $0xb,%r8
0x000000000040dae1 <+113>:  mov    %r8d,0xa64100(,%rax,8)
0x000000000040dae9 <+121>:  mov    %r8d,%r9d
0x000000000040daec <+124>:  imul   %rdx,%r9
0x000000000040daf0 <+128>:  add    $0xb,%r9
0x000000000040daf4 <+132>:  mov    %r9d,0xaa4300(,%rax,8)
0x000000000040dafc <+140>:  mov    %r9d,%r10d
0x000000000040daff <+143>:  imul   %rdx,%r10
0x000000000040db03 <+147>:  add    $0xb,%r10
0x000000000040db07 <+151>:  mov    %r10d,0x9a3b04(,%rax,8)
0x000000000040db0f <+159>:  mov    %r10d,%r11d
0x000000000040db12 <+162>:  imul   %rdx,%r11
0x000000000040db16 <+166>:  add    $0xb,%r11
0x000000000040db1a <+170>:  mov    %r11d,0x9e3d04(,%rax,8)
0x000000000040db22 <+178>:  mov    %r11d,%ecx
0x000000000040db25 <+181>:  imul   %rdx,%rcx
0x000000000040db29 <+185>:  add    $0xb,%rcx
0x000000000040db2d <+189>:  mov    %ecx,0xa23f04(,%rax,8)

2) the output of p $mxcsr:

(gdb) p $mxcsr
1 = [ ZE PE DAZ DM PM FZ ]

3) the output of p $ymm0.v8_float:

$2 = {3, 3, 3, 3, 3, 3, 3, 3}

4) the output of p $zmm0.v16_float:

gdb) p $zmm0.v16_float
$3 = {3 <repeats 16 times>}.

I shoud also mention that to detect floating point exceptions i used the standard

void handler(int sig)
{
  printf("Floating Point Exception\n");
  exit(0);
}
...
int main(int argc, char **argv)
{
  feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW);
  signal(SIGFPE, handler);
  ...
}

I should stress that i have already been using feenableexcept when i got this error. I used it since the begin of program debugging because we had the errors (Floating Point Exceptions) in code and had to correct them.

Are you always compiling on the i7, and running on the KNL? On i7, `-march=native` is the same as compiling with `-march=skylake` or whatever it is. Native means to make code that assumes it's running on the same machine that compiled it, so you shouldn't expect it to work on other machines. — Peter Cordes, Oct 01 '18 at 13:20
If you need fastest executable ever, getting last bit of performance out of the code, you should compile code twice (or whatever number of distinctive platforms you have) - once native for KNL and another one native for i7 — Severin Pappadeux, Oct 01 '18 at 13:22
The code is compiled and run on the same machine: on i7, when we work on i7, and on KNL, when the work is on KNL. I only wanted to say that this flag works on i7 and doesn't work on KNL. Of course, when launching the executable on i7 i compile it on i7 and when launching on KNL - on KNL. — And, Oct 01 '18 at 13:23
Is `rd == 0.0` at that point or something? Do you have FP exceptions unmasked on your KNL system? Different compiler options can produce different FP behaviour (Intel's compiler enabled the equivalent of `-ffast-math` so it's probably using AVX512ER (KNL-only) [VRSQRT28PS](http://felixcloutier.com/x86/VRSQRT28PS.html) to get a highish-precision fast approximation recip sqrt, much better than the `vrsqrt14ps` from plain AVX512, or 12-bit from plain SSE/AVX1 `vrsqrtps`. — Peter Cordes, Oct 01 '18 at 13:23
rd mustn't be ==0.0. It may be small, but not zero. Without "-march=native" everything works=>without "-march=native" rd !=0.0=>what i say is right. — And, Oct 01 '18 at 13:27
Which asm instruction triggers an FP exception, and why? (What values in registers explain it raising an exception? So what values are in registers, and what is MXCSR?) Normally you only ever get SIGFPE on Linux from *integer* division, because FP exceptions are masked. Looking at MXCSR will tell us if FP exceptions are unmasked. — Peter Cordes, Oct 01 '18 at 14:14
asm instruction - assembler instruction, perhaps? Found on the Internet: MXCSR - SIMD Floating-Point Control/Status Register). I check SIGFPE using void handler(int sig) { printf("Floating Point Exception\n"); exit(0); } and feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW); signal(SIGFPE, handler); in main(). But in the parallel regime it is likelly not to work properly. I don't know how to see what values are in registers, don't know how to look at MXCSR and can't read assembler code, excuse me. — And, Oct 01 '18 at 14:23
When your program stops with an exception in GDB, do `disas` to disassemble the instruction that faulted, and `p $mxcsr` to print that control/status register. Also do `p $ymm0.v8_float` or `zmm0.v16_float`, for all the vector registers involved in the instruction that faulted. — Peter Cordes, Oct 02 '18 at 01:08
@And: Were you *already* using `feenableexcept` when you got this error, or did you just add that after my last comment? You need to include that in the question because it's the key to the answer. Intel's compiler with `-fp-model fast=2`, or `gcc -ffast-math`, assumes that FP exceptions are masked so it can cause FE_INVALID in some SIMD elements in some temporary calculations, as long as everything works out in the end (e.g. blend to fix up elements where recip-sqrt went wrong). I'd assume that's what's going on here. — Peter Cordes, Oct 02 '18 at 01:13
@Peter Cordes, I have already been using feenableexcept when i got this error. I used it since the begin of program debugging because we had the errors (Floating Point Exceptions) in code and had to correct them. — And, Oct 02 '18 at 07:45
Oh, I forgot that `disas` would disassemble from the start of the function, instead of at `$rip`. You left out the disassembly of `0x000000000040e3cc`, the instruction where GDB reported the fault. Keep going until you get to the faulting one (and a few beyond, preferably until the next `jcc` (any instruction that starts with j is a branch). Also, was it *actually* `zmm0` that the faulting instruction used? I used 0 as a placeholder, but I meant to print the contents of all the register operands for that instruction. (All the named registers on that line.) — Peter Cordes, Oct 02 '18 at 07:49
Excuse me, I don't fully understand what You are speaking about (i launch gdb, it throws a SIGFPE, what is a disassembly of 0x000000000040e3cc and how to print it, i don't know; whats does it mean : "keep going you get to the faulting one"). Could You be so kind, please, to explain what is needed in more simple words. I'm not familiar with assembler. If it is necessary, i can send You a small code of about 200 lines reproducing the error (have just found that the error may be reproduced by a small part of the whole code). — And, Oct 02 '18 at 08:17

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

You were using feenableexcept to unmask some FP exceptions, so optimizations that create invalid temporary results will crash your program.

Intel's compiler with -fp-model fast=2, like gcc -ffast-math, assumes that FP exceptions are masked so it can cause FE_INVALID in some SIMD elements in some temporary calculations, as long as everything works out in the end (e.g. blend to fix up elements where recip-sqrt went wrong). I'd assume that's what's going on here.

If you post disassembly of the actual instruction that faulted (instead of a bunch of integer multiplies at the very start of that function), we can figure out exactly what optimization caused what invalid temporary, but in general you need to use less aggressive FP options when compiling builds that turn on FP exceptions.

According to Intel's documentation:

-fp-model fast[=1|2] or /fp:fast[=1|2]

Floating-point exception semantics are disabled by default and they cannot be enabled because you cannot specify fast and except together in the same compilation. To enable exception semantics, you must explicitly specify another keyword (see other keyword descriptions for details).

You need to use -fp-model except if you want the compiler to respect the fact that FP exceptions are a visible side-effect. This is not on by default.

If you're going to call functions that modify the FP environment, ISO C says you should use #pragma STDC FENV_ACCESS ON, and that without that, modifications to the FP environment aren't "meaningful". "Otherwise the implementation is free to assume that floating-point control modes are always the default ones and that floating-point status flags are never tested or modified." I'm not sure if enabling exceptions really counts. Probably not important, as long as you're doing it once at program startup, otherwise it would matter whether or not a computation happens before or after enabling exceptions.

Similarly for gcc, -ffast-math includes -fno-trapping-math, which promises the compiler that FP instructions won't raise SIGFPE, just silently set sticky status bits in the MXCSR and produce NaN (invalid), +-Infinity (overflow), or 0.0 (underflow).

"functions that modify the FP environment" - it means throwing FP exceptions that modify the flags of the floating point environment? I tried to write #pragma FENV_ACCESS ON, #pragma STDC FENV_ACCESS ON (https://en.cppreference.com/w/cpp/preprocessor/impl), but the compiler writes: warning #161: unrecognized #pragma. — And, Oct 02 '18 at 09:05
@And: No, `feenableexcept` is a function that modifies the FP environment by changing the exception mask to unmask some exceptions, so computations *after* that call behave differently. I'm not sure if that counts, or if only changing the rounding mode would matter. (You need to stop the compiler from reordering computation across `fesetenv`, because it matters whether something is computed before or after changing the rounding mode.) — Peter Cordes, Oct 02 '18 at 09:09
I had one more question, are flags "-march=native" and "-march=knl" equal if to launch the code on KNL (as i see, they aren't, but why?), do You know? — And, Oct 02 '18 at 09:17
@And: When you're compiling *on* KNL, I think they are equivalent for gcc. I thought ICC would be the same, too, but I'm not sure. — Peter Cordes, Oct 02 '18 at 09:23

adding "-march=native" intel compiler flag to the compilation line leads to a floating point exception on KNL

1 Answers1

Linked