answer is not correct using icc compiler

Question

When I use icc compiler on mac, I could not obtain same answer with other compiler such as gcc, clang. Using icc compiler, the result was below

0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00

The expected answer is here

1.000000e+00
2.000000e+00
3.000000e+00
4.000000e+00
2.500000e+01
3.000000e+01
3.500000e+01
4.000000e+01

I compiled like this:

icc: icc test1.c -fopenmp -mavx -Wall
gcc: gcc test1.c -fopenmp -mavx -Wall
clang: clang test1.c -fopenmp -mavx -Wall

My code is as follows:

#include "stdio.h"
#include "time.h"
#include "math.h"
#include "stdlib.h"
#include "omp.h"
#include "x86intrin.h"

void dd_m_dd(double *ahi, double *bhi, double *chi, int m, int n)
{

    int j;
    #pragma omp parallel
    {
        __m256d vahi,vbhi,vchi;
        #pragma omp for private(vahi,vbhi,vchi)
        for (j = 0; j < m*n; j+=4) {

            vbhi = _mm256_broadcast_sd(&bhi[j]);
            vahi = _mm256_load_pd(&ahi[j]);
            vchi = _mm256_load_pd(&chi[j]);

            vchi=vahi*vbhi;

            chi[j]=vchi[0];
            chi[j+1]=vchi[1];
            chi[j+2]=vchi[2];
            chi[j+3]=vchi[3];

        }
    }
}

int main(int argc, const char * argv[]){
    // Matrix Vector Product with DD

    // set variables
    int m;
    double* xhi;
    double* yhi;
    double* z;
    int i;

    m=(int)pow(2,3);
    // main program

    // set vector or matrix
    xhi=(double *)malloc(sizeof(double) * m*1);
    yhi=(double *)malloc(sizeof(double) * m*1);
    z=(double *)malloc(sizeof(double) * m*1);
    //preset
    for (i=0;i<m;i++) {
        xhi[i]=i+1;
        yhi[i]=i+1;
        z[i]=0;
    }

    dd_m_dd(xhi,yhi,z,m,1);

    for (i=0;i<m;i++) {
        printf("%e\n",z[i]);
    }

    free(xhi);
    free(yhi);
    free(z);
    return 0;
}

What is happening here?

Going off memory of old cppcon videos here, but I think icc has `-ffast-math` on by default. I don't know if that could come into play for your example, but probably worth testing. — Stephen Newell, May 14 '18 at 05:59
I'd suggest using `-march=native`, not just `-mavx`, to tune for your target machine instead of just enabling AVX while tuning for generic. (Especially with gcc). — Peter Cordes, May 14 '18 at 14:51
`vchi = _mm256_load_pd(&chi[j]);` is immediately overwritten with `vchi=vahi*vbhi;` That looks wrong; did you mean to add with `+=` instead of assign? — Peter Cordes, May 14 '18 at 14:53

score 3 · Answer 1 · answered May 14 '18 at 06:57

3

I'm not used to vector intrinsics, but this looks very suspicious to me:

    chi[j]=vchi[0];
    chi[j+1]=vchi[1];
    chi[j+2]=vchi[2];
    chi[j+3]=vchi[3];

And as a matter of fact, replacing it with what looks very much like the right function for the job, namely _mm256_store_pd() seems to be fixing the issue.

Your function could now look like this (with a few stylistic fixes as well)

void dd_m_dd(double *ahi, double *bhi, double *chi, int m, int n) {

    #pragma omp parallel for
    for (int j = 0; j < m*n; j+=4) {

        __m256d vbhi = _mm256_broadcast_sd(&bhi[j]);
        __m256d vahi = _mm256_load_pd(&ahi[j]);

        __m256d vchi=vahi*vbhi;

        _mm256_store_pd( &chi[j], vchi );
    }
}

Another issue is that you do not enforce the proper alignment of your pointers... Rewriting the allocations like this just fixes it:

double *xhi=(double *)aligned_alloc(256, sizeof(double) * m*1);
double *yhi=(double *)aligned_alloc(256, sizeof(double) * m*1);
double *z=(double *)aligned_alloc(256, sizeof(double) * m*1);

answered May 14 '18 at 06:57

Gilles

9,269
4
34
53

might as well do `vchi = _mm256_mul_pd(vahi,vdhi` – Z boson May 14 '18 at 10:19
@Zboson Yeah, I was wondering about that while writing my answer, but since I'm pretty green with these things, I preferred to stick to the OP's original method. Now, out of curiosity, I did try `*` vs. `_mm256_mul_pd()` and turns out that the compiler generated exactly the same code with both (which doesn't really surprise me TBH). Now which of the 2 is better, IDK... – Gilles May 14 '18 at 14:17
@Zboson: Compilers that implement GNU C extensions define `__m256d` in terms of GNU C native vectors, so `__m256d vchi=vahi*vbhi;` is well-defined as a vertical SIMD multiply (`vmulpd`). And BTW, Gilles: we don't know `chi` is aligned, so `_mm256_storeu_pd` should be used. But again, GNU C native vector extensions define `vchi[1]` syntax to do what the OP expects. (No guarantee that it compiles *efficiently* with the 4 elements done with one `vmovupd`, but this doesn't explain a correctness problem.) – Peter Cordes May 14 '18 at 14:43
re: which is better? The `*` operator is more readable, but not portable to MSVC unless you use a wrapper library like Agner Fog's VCL. It also will get you in trouble with integer vectors, because those are defined in terms of `long long`, so you get `paddq` or whatever, regardless of whether it came from `set1_epi8` or `set1_epi64`. Agner Fog's VCL has different types for different element widths of integer vectors, so you can use `v1 + v2` and `v1 * v2` there. http://agner.org/optimize/#vectorclass. – Peter Cordes May 14 '18 at 14:47
Actually, maybe ICC doesn't like `vchi[0]`. Getting all-zero as the result is more than just `-ffast-math` rounding error. With the per-element stores, ICC's asm output doesn't include `vmul` at all, so it's not doing any FP multiplies! https://godbolt.org/g/nTi4pE With `_mm256_storeu_pd`, there's a `vmulpd`. – Peter Cordes May 14 '18 at 14:58
@PeterCordes, ICC only has the problem with OpenMP https://godbolt.org/g/rCu6Pj. It looks like a bug. – Z boson May 15 '18 at 08:24
`*(__m256d*)chi = vchi;` does not have the problem (though it assumes `chi` is 32 byte aligned) https://godbolt.org/g/2LqCwe. – Z boson May 15 '18 at 08:36
1

@PeterCordes, personally I try to use all vector extensions or all intrinsics when possible. In the OP's case only the broadcast intrinsic is necessary with ICC (but not with GCC or Clang https://stackoverflow.com/a/43801280/2542702) if the pointers are 32 byte aligned. – Z boson May 15 '18 at 08:40

answer is not correct using icc compiler

1 Answers1