segmentation fault for `vmovaps'

Question

I wrote a code to add two arrays using KNC instructions with (512bit long vectors) on Xeon Phi intel coprocessor. However I've got segmentation part in the inline assembly part.

Here it is my code:

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length / 16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
            A[i] = 1;
            B[i] = 2;
    }

    float * pA = A;
    float * pB = B;
    float * pC = C;
    for(i=0; i<AVXLength; i++ ){
         __asm__("vmovaps %1,%%zmm0\n"
                    "vmovaps %2,%%zmm1\n"
                    "vaddps %%zmm0,%%zmm0,%%zmm1\n"
                    "vmovaps %%zmm0,%0;"
            : "=m" (pC) : "m" (pA), "m" (pB));

            pA += 512;
            pB += 512;
            pC += 512;
    }
    return 0;
}

I am using gcc as a compiler (because I don't have money to buy intel compiler). And this is my command line to compile this code:

k1om-mpss-linux-gcc add.c -o add.out

The problem was in the inline assembly. The following inline assembly fixed it.

__asm__("vmovaps %1,%%zmm1\n"
        "vmovaps %2,%%zmm2\n"
        "vaddps %%zmm1,%%zmm2,%%zmm3\n"
        "vmovaps %%zmm3,%0;"
        : "=m" (*pC) : "m" (*pA), "m" (*pB));

There are a lot of compile switches that control floating point. And some that control avx. Can you tell us which ones you are using? — David Wohlferd, Dec 08 '15 at 05:59
When you posted [an almost identical question recently](http://stackoverflow.com/questions/34114092/vector-sum-using-avx-inline-assembly-on-xeonphi) it was [pointed out to you that the first generation Xeon Phi (Knight's Corner) does not support AVX](http://stackoverflow.com/a/34115089/253056). — Paul R, Dec 08 '15 at 11:24
For `KNCI` you would need to be using the `zmm` registers with `vaddps`. Note also that intrinsics are *much* easier to use for this than raw inline asm (as also noted in comments on your previous question). See also [this very relevant answer](http://stackoverflow.com/a/22719429/253056). — Paul R, Dec 08 '15 at 11:30
I post a question, but nobody answered my question. KNC doesn not support AVX512 but it should support AVX256 — Hamid_UMB, Dec 08 '15 at 18:40
You should not apply vector floating point addition the `pA`, `pB` and `pC` as they are pointers. `*pA` etc, whilst closer to correct type is still too small, you should use the appropriate vector type. — Timothy Baldwin, Dec 08 '15 at 22:18
@PaulR, in this case the mnemonics for KNC and AVX512 are exactly the same. I suspect AVX could be used used `ymm` instead of `zmm` and it would work fine on KNC (the OP has since modifed the code to use zmm instead but the essence of the question has not changed). It's just a matter of finding a compiler/assembler that generates the correct opcodes. The OP is using a specail version of GCC which does this. — Z boson, Dec 12 '15 at 17:26
@PaulR, I think I am wrong about `ymm`. The intel documentation is clear that KNC does not support "Instructions that operate on YMM registers" (or XMM or MMX for that matter). So it's some partially overlapping subset of AVX512 only. — Z boson, Dec 28 '15 at 20:45
@Zboson: yes, it's been a while since I did any work on KNC but that sounds about right - KNL and beyond is a whole different story. — Paul R, Dec 29 '15 at 08:41
@PaulR, you were disappointed with KNC? I am still considering getting one if it does not cost too much. I think it will still beat any desktop AVX512 processor which comes out maybe in a year. — Z boson, Dec 29 '15 at 10:22
@Zboson: yes, for the sort of workloads I'm interested in KNC was not particularly impressive, KNL looks much more promising, and of course eventually we will have Purley with AVX-512 (2017 ?). — Paul R, Dec 29 '15 at 10:45
@PaulR, KNL looks more promising until you consider the price. But I think KNC/price makes it much more interesting. But the cheap KNC cards are passively cooled and only work for some motherboards so it could be a lot of hassle to get a working cheap system. And it seems like the Intel compiler is almost necessary. — Z boson, Dec 29 '15 at 11:10
@Zboson: KNC is something of a dead-end though, in that if you spend time learning its architecture and somewhat unique SIMD ISA this will probably not be particularly useful for any other subsequent Xeon Phi architecture. Which is fine if you just want something to play with, or if you have a particular project for which this makes sense in the short term - otherwise I think I'd just play with AVX-512 on the Intel sde while waiting for KNL and/or Purley to arrive. — Paul R, Dec 29 '15 at 13:52
@PaulR, I don't think we will see Purley for more than a year and it's not going to be cheap. I would expect 500 USD for four cores with AVX512 compared to 200 USD for 51 cores and something close to AVX512. But the big question is why were you disappointed in KNC? What card did you use? — Z boson, Dec 29 '15 at 14:19
@Zboson: main disappointment with KNC was that it was pretty much impossible to get anywhere near theoretical throughout due to the limitations of the rather bare bones cores - poor SIMD integer/fixed point support too IIRC. — Paul R, Dec 29 '15 at 17:21

score 4 · Accepted Answer · edited May 23 '17 at 12:31

As already explained, Knights Corner (KNC) does not have AVX512. However, it does have something similar. It turns out that the KNC vs AVX512 issue is a red herring here. The problem is in the OPs inline assembly.

Instead of using inline assembly I suggest you use intrinsics. The KNC intrinsics are described at the Intel Intrinsic Guide online.

Additionally, Przemysław Karpiński at CERN extend Agner Fog's Vector Class Library to use KNC. You can find the git repository here. If you look in the file vectorf512_mic.h you can learn a lot about the KNC intrinsics.

I converted your code to use these intrinsics (which turn out in this case to be the same as the AVX512 intrinsics):

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length /16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
        A[i] = 1;
        B[i] = 2;
    }
    for(i=0; i<AVXLength; i++ ){
        __m512 a16 = _mm512_load_ps(&A[16*i]);
        __m512 b16 = _mm512_load_ps(&B[16*i]);
        __m512 s16 = _mm512_add_ps(a16,b16);
        _mm512_store_ps(&C[16*i], s16);
    }
    return 0;
}

The KNC intrinsics are only supported by ICC. However, KNC comes with the Manycore Platform Software Stack (MCSS) which comes with a special version of gcc, k1om-mpss-linux-gcc, which can use the AVX512 like features of KNC using inline assembly.

The mnemoncis for KNC and AVX512 are the same in this case. Therefore we can use AVX512 intrinsics to discover the assembly to use

void foo(int *A, int *B, int *C) {
    __m512i a16 = _mm512_load_epi32(A);
    __m512i b16 = _mm512_load_epi32(B);
    __m512i s16 = _mm512_add_epi32(a16,b16);
    _mm512_store_epi32(C, s16);
}

and gcc -O3 -mavx512 knc.c produces

vmovaps (%rdi), %zmm0
vaddps  (%rsi), %zmm0, %zmm0
vmovaps %zmm0, (%rdx)

From this one solution using inline assembly would be

__asm__("vmovaps   (%1), %%zmm0\n"
        "vpaddps   (%2), %%zmm0, %%zmm0\n"
        "vmovaps   %%zmm0, (%0)"
        :
        : "r" (pC), "r" (pA), "r" (pB)
        :
);

With the previous code GCC generates add instructions for each array. Here is a better solution using an index register which only generates one add.

for(i=0; i<length; i+=16){
    __asm__ __volatile__ (
            "vmovaps   (%1,%3,4), %%zmm0\n"
            "vpaddps   (%2,%3,4), %%zmm0, %%zmm0\n"
            "vmovaps   %%zmm0, (%0,%3,4)"
            :
            : "r" (C), "r" (A), "r" (B), "r" (i)
            : "memory"
     );
 }

The latest version of the MPSS (3.6) includes GCC 5.1.1 which supports AVX512 intrinsics. So I think you can use AVX512 intrinsics whenever they are the same as the KNC intrinsics and only use inline assembly when they disagree. Looking at the Intel Intrinsic guide shows that they are the same in most cases.

Thank you for your answer. My problem is that I don't have Intel Compiler and I want to use GCC. So, I can not use intrinsics. Can you tell me how can I replace the intrinsics with inline assembly to use GCC?? — Hamid_UMB, Dec 11 '15 at 23:16
@Hamid_UMB, if you fixed your problem could you please updated your question with the solution? Or could you provide an answer to [my question](http://stackoverflow.com/questions/26933394/xeon-phi-knights-corner-intrinsics-with-gcc) with an example of your code and the instructions you used to get it working on KNC. I don't own KNC but if I was to get one it would be nice to know how to do this. — Z boson, Dec 12 '15 at 12:08
@Hamid_UMB, I just noticed you edited your answer. I assume you added the working solution. You should have appended your question with the working soltuion not overwritten it. Anybody coming to your question now reading it for the first time may think it's still broken. — Z boson, Dec 12 '15 at 17:13
@Hamid_UMB, you also edited your original code which only used 256-bit vectors. This is not good to do because then my answer which was written with that code has to change. I rolled back to your last edit and then pasted the code you used to fix it. — Z boson, Dec 12 '15 at 17:20
@Hamid_UMB, I added a solution using inline assembly for your main loop which should be more efficient. — Z boson, Dec 12 '15 at 19:15
@Hamid_UMB, I just downloaded `mpss-3.6` and it usess GCC 5.1.1 which supports AVX512 intrinsics so I think you can just use the AVX512 intrinsics. Check the documentation for the cases when AVX512 and KNC have the same intrinsics. — Z boson, Dec 13 '15 at 10:19

segmentation fault for `vmovaps'

1 Answers1

Linked