I have a code, which i launch on Intel Xeon Phi Knights Landing (KNL) 7210 (64 cores) processor (it is a PC, in native mode) and use the Intel c++ compiler (icpc) version 17.0.4. Also i launch the same code on Intel core i7 processor, where the version of icpc is 17.0.1. To be more correct, i compile the code on the machine i'm launching it (compiled on i7 and launched on i7, the same for KNL). I never make the binary file on one machine and bring it to another. The loops are parallelized and vectorized using OpenMP. For best performance i use the intel compiler flags:
-DCMAKE_CXX_COMPILER="-march=native -mtune=native -ipo16 -fp-model fast=2 -O3 -qopt-report=5 -mcmodel=large"
On i7 everything works well. But on KNL the code works withous -march=native
and if to add this option the program throws floating point exception immediately. If to compile with the only flag "-march=native" the situation is the same. If to use gdb, it points at the line pp+=alpha/rd
of the piece of code:
...
the code above is run in 1 thread
double K1=0.0, P=0.0;
#pragma omp parallel for reduction(+:P_x,P_y,P_z, K1,P)
for(int i=0; i<N; ++i)
{
P_x+=p[i].vx*p[i].m;
P_y+=p[i].vy*p[i].m;
P_z+=p[i].vz*p[i].m;
K1+=p[i].vx*p[i].vx+p[i].vy*p[i].vy+p[i].vz*p[i].vz;
float pp=0.0;
#pragma simd reduction(+:pp)
for(int j=0; j<N; ++j) if(i!=j)
{
float rd=sqrt((p[i].x-p[j].x)*(p[i].x-p[j].x)+(p[i].y-p[j].y)*(p[i].y-p[j].y)+(p[i].z-p[j].z)*(p[i].z-p[j].z));
pp+=alpha/rd;
}
P+=pp;
}
...
Particle p[N];
- an array of particles, Particle is a structure of floats. N - maximal number of particles.
If to remove the flag -march=native
or replace it with -march=knl
or with -march=core-avx2
, everything woks OK. This flag is doing something bad to the program, but what - I don't know.
I found in the Internet (https://software.intel.com/en-us/articles/porting-applications-from-knights-corner-to-knights-landing, https://math-linux.com/linux/tip-of-the-day/article/intel-compilation-for-mic-architecture-knl-knights-landing) that one should use the flags: -xMIC-AVX512
. I tried to use this flag and -axMIC-AVX512
, but they give the same error.
So, what i wanted to ask is:
Why
-march=native
,-xMIC-AVX512
do not work and-march=knl
works; is-xMIC-AVX512
included in-march=native
flag for KNL?May I replace the flag
-march=native
with-march=knl
when I launch the code on KNL (on i7 everything works), are they equivalent?Is the set of flags written optimal for the best performance if using Intel compiler?
As, Peter Cordes told, i placed here the assembeler output when the program throws Floating Point Exception in GDB: 1) the output of (gdb) disas:
Program received signal SIGFPE, Arithmetic exception.
0x000000000040e3cc in randomizeBodies() ()
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-
16.el7.x86_64 libstdc++-4.8.5-16.el7.x86_64
(gdb) disas
Dump of assembler code for function _Z15randomizeBodiesv:
0x000000000040da70 <+0>: push %rbp
0x000000000040da71 <+1>: mov %rsp,%rbp
0x000000000040da74 <+4>: and $0xffffffffffffffc0,%rsp
0x000000000040da78 <+8>: sub $0x100,%rsp
0x000000000040da7f <+15>: vpxor %xmm0,%xmm0,%xmm0
0x000000000040da83 <+19>: vmovups %xmm0,(%rsp)
0x000000000040da88 <+24>: vxorpd %xmm5,%xmm5,%xmm5
0x000000000040da8c <+28>: vmovq %xmm0,0x10(%rsp)
0x000000000040da92 <+34>: mov $0x77359400,%ecx
0x000000000040da97 <+39>: xor %eax,%eax
0x000000000040da99 <+41>: movabs $0x5deece66d,%rdx
0x000000000040daa3 <+51>: mov %ecx,%ecx
0x000000000040daa5 <+53>: imul %rdx,%rcx
0x000000000040daa9 <+57>: add $0xb,%rcx
0x000000000040daad <+61>: mov %ecx,0x9a3b00(,%rax,8)
0x000000000040dab4 <+68>: mov %ecx,%esi
0x000000000040dab6 <+70>: imul %rdx,%rsi
0x000000000040daba <+74>: add $0xb,%rsi
0x000000000040dabe <+78>: mov %esi,0x9e3d00(,%rax,8)
0x000000000040dac5 <+85>: mov %esi,%edi
0x000000000040dac7 <+87>: imul %rdx,%rdi
0x000000000040dacb <+91>: add $0xb,%rdi
0x000000000040dacf <+95>: mov %edi,0xa23f00(,%rax,8)
0x000000000040dad6 <+102>: mov %edi,%r8d
0x000000000040dad9 <+105>: imul %rdx,%r8
0x000000000040dadd <+109>: add $0xb,%r8
0x000000000040dae1 <+113>: mov %r8d,0xa64100(,%rax,8)
0x000000000040dae9 <+121>: mov %r8d,%r9d
0x000000000040daec <+124>: imul %rdx,%r9
0x000000000040daf0 <+128>: add $0xb,%r9
0x000000000040daf4 <+132>: mov %r9d,0xaa4300(,%rax,8)
0x000000000040dafc <+140>: mov %r9d,%r10d
0x000000000040daff <+143>: imul %rdx,%r10
0x000000000040db03 <+147>: add $0xb,%r10
0x000000000040db07 <+151>: mov %r10d,0x9a3b04(,%rax,8)
0x000000000040db0f <+159>: mov %r10d,%r11d
0x000000000040db12 <+162>: imul %rdx,%r11
0x000000000040db16 <+166>: add $0xb,%r11
0x000000000040db1a <+170>: mov %r11d,0x9e3d04(,%rax,8)
0x000000000040db22 <+178>: mov %r11d,%ecx
0x000000000040db25 <+181>: imul %rdx,%rcx
0x000000000040db29 <+185>: add $0xb,%rcx
0x000000000040db2d <+189>: mov %ecx,0xa23f04(,%rax,8)
2) the output of p $mxcsr:
(gdb) p $mxcsr
1 = [ ZE PE DAZ DM PM FZ ]
3) the output of p $ymm0.v8_float:
$2 = {3, 3, 3, 3, 3, 3, 3, 3}
4) the output of p $zmm0.v16_float:
gdb) p $zmm0.v16_float
$3 = {3 <repeats 16 times>}.
I shoud also mention that to detect floating point exceptions i used the standard
void handler(int sig)
{
printf("Floating Point Exception\n");
exit(0);
}
...
int main(int argc, char **argv)
{
feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW);
signal(SIGFPE, handler);
...
}
I should stress that i have already been using feenableexcept when i got this error. I used it since the begin of program debugging because we had the errors (Floating Point Exceptions) in code and had to correct them.