inline assembly of reduce operation for Xeon Phi

Question

I am looking for inline assembly operation for add reduce operation for Xeon Phi. I found _mm512_reduce_add_epi32 intrinsic on intel intrinsic website (link). However in the website, they did not mentioned the actual assembly operation for it.

Can anybody help me to found the inline assembly of reduction operation on Xeon Phi platform?

Thanks

@Jeff: I don't have intel compiler, so I can not run the intrinsics. The only option that I have is to write code with inline assembly. If it is possible, please run the code with -S and gave me the results. — Hamid_UMB, Dec 23 '15 at 04:50
Facepalm. Sorry. I forget that KNC is different. Ill try to remember to send you asm later. — Jeff Hammond, Dec 23 '15 at 04:53
ISPC (https://github.com/ispc/ispc) supports KNC (see e.g. https://github.com/ispc/ispc/blob/master/examples/intrinsics/knc-i1x16.h). You might try that. — Jeff Hammond, Dec 23 '15 at 05:45
The reason Intel does not list the instruction is because this is one of those annoying compound instruction intrinsics which requires several instructions. You can implement this yourself in log(N) operations. I mean add high 256-bits to low 256-bits, add high 128-bits to low 128-bits, add high 64-bits to low 64-bits, add high 32-bits to low 32-bits. — Z boson, Dec 27 '15 at 19:10
Can you please give the accepted answer back to @Gilles. I would not have been able to create my answer without his answer and I think his answer answered your question. — Z boson, Jan 02 '16 at 09:01

score 4 · Answer 1 · answered Dec 23 '15 at 06:52

I know close to nothing when it comes to reading assembly, so I simply did that:

Created a foo.c file like this:

#include "immintrin.h"

int foo(__m512i a) {
    return _mm512_reduce_add_epi32(a);
}

Which I complied with the Intel compiler version 16.0.1 using -mmic -S. And it gave me the following assembly code:

# -- Begin  foo
    .text
# mark_begin;
# Threads 4
        .align    16,0x90
    .globl foo
# --- foo(__m512i)
foo:
# parameter 1: %zmm0
..B1.1:                         # Preds ..B1.0 Latency 53
    .cfi_startproc
..___tag_value_foo.1:
..L2:
                                                          #3.20
        movl      $1, %eax                                      #4.12 c1
        vpermf32x4 $238, %zmm0, %zmm1                           #4.12 c5
        kmov      %eax, %k1                                     #4.12 c5
        vpaddd    %zmm0, %zmm1, %zmm3                           #4.12 c9
        nop                                                     #4.12 c13
        vpermf32x4 $85, %zmm3, %zmm2                            #4.12 c17
        vpaddd    %zmm3, %zmm2, %zmm4                           #4.12 c21
        nop                                                     #4.12 c25
        vpaddd    %zmm4{badc}, %zmm4, %zmm5                     #4.12 c29
        nop                                                     #4.12 c33
        vpaddd    %zmm5{cdab}, %zmm5, %zmm6                     #4.12 c37
        nop                                                     #4.12 c41
        vpackstorelps %zmm6, -8(%rsp){%k1}                      #4.12 c45
        movl      -8(%rsp), %eax                                #4.12 c49
        ret                                                     #4.12 c53
        .align    16,0x90
    .cfi_endproc
                                # LOE
# mark_end;
    .type   foo,@function
    .size   foo,.-foo
    .data
# -- End  foo
    .data
    .section .note.GNU-stack, ""
// -- Begin DWARF2 SEGMENT .eh_frame
    .section .eh_frame,"a",@progbits
.eh_frame_seg:
    .align 8
# End

I guess you should be able to find your way in that...

`vpermf32x4` and the `{abcd}` swizzle notation are unique to KNC. I wonder if `k1om-mpss-linux-gcc` implements them. — Z boson, Dec 27 '15 at 19:07
If you can, would you mind posting the result for AVX512? It would be interesting to see how Intel implements this with AVX512 compared to KNC. — Z boson, Dec 29 '15 at 09:17
I'm sorry the OP switched the accepted answer to mine. That was not my intention. I asked the OP to switch it back but it did not happen. — Z boson, Jan 25 '16 at 12:36
@Zboson Don't worry, I'm cool with that, absolutely no problem. — Gilles, Jan 25 '16 at 12:42

score 4 · Accepted Answer · edited May 23 '17 at 10:30

Doing a reduction of 16 integers with KNC is an interesting case to show why it differs from AVX512.

The _mm512_reduce_add_epi32 intrinsic is only supported by the Intel compile (currently). It's one of those annoying many instruction intrinsics like in SVML. But I think I understand why Intel implemented this intrinsic as in this case because the result for KNC and AVX512 are very different.

With AVX512 I would do something like this

__m256i hi8 = _mm512_extracti64x4_epi64(a,1);
__m256i lo8 = _mm512_castsi512_si256(a);
__m256i vsum1 = _mm256_add_epi32(hi8,lo8);

and the then I would do a reduction just like in AVX2

__m256i vsum2  = _mm256_hadd_epi32(vsum1,vsum1);
__m256i vsum3  = _mm256_hadd_epi32(vsum2,vsum2);
__m128i hi4 = _mm256_extracti128_si256(vsum3,1);
__m128i lo4 = _mm256_castsi256_si128(vsum3);
__m128i vsum4 = _mm_add_epi32(hi4, lo4);
int sum = _mm_cvtsi128_si32(vsum4);

It would be interesting to see how Intel implements _mm512_reduce_add_epi32 with AVX512.

But the KNC instruction set does not support AVX or SSE so everything has to be done with the full 512-bit vectors with KNC. Intel has created instructions unique to KNC to do this.

Looking at the assembly from Giles answer we can see what it does. First it permutes the upper 256-bits to the lower 256-bits using an instruction unique to KNC like this:

vpermf32x4 $238, %zmm0, %zmm1

The value 238 is 3232 in base 4. So zmm1 in terms of the four 128-bit lanes is (3,2,3,2).

Next it does a vector sum

vpaddd    %zmm0, %zmm1, %zmm3

which gives the four 128-bit lanes (3+3, 2+2, 3+1, 2+0)

Then it permutes the second 128-bit lane giving (3+1, 3+1, 3+1, 3+1) like this

vpermf32x4 $85, %zmm3, %zmm2

where 85 is 1111 in base 4. Then it adds these together

vpaddd    %zmm3, %zmm2, %zmm4

so that the lower 128-bit lane in zmm4 contains the sum of the four 128-bit lanes (3+2+1+0).

At this point it needs to permute the 32-bit values within each 128-bit lane. Again it uses a unique feature of KNC which allows it to permute and add at the same time (or at least the notation is unique).

vpaddd    %zmm4{badc}, %zmm4, %zmm5

produces (a+b, a+b, c+d, c+d)

and

vpaddd    %zmm5{cdab}, %zmm5, %zmm6

produces (a+b+c+d , a+b+c+d , a+b+c+d, a+b+c+d). Now it is just a matter of extracting the lower 32-bits.

Here is an alternative solution for AVX512 which is similar to the solution for KNC

#include <x86intrin.h>  
int foo(__m512i a) {   
    __m512i vsum1 = _mm512_add_epi32(a,_mm512_shuffle_i64x2(a,a, 0xee));
    __m512i vsum2 = _mm512_add_epi32(a,_mm512_shuffle_i64x2(vsum1,vsum1, 0x55));
    __m512i vsum3 = _mm512_add_epi32(a,_mm512_shuffle_epi32(vsum2, _MM_PERM_BADC));
    __m512i vsum4 = _mm512_add_epi32(a,_mm512_shuffle_epi32(vsum3, _MM_PERM_CADB));
    return _mm_cvtsi128_si32(_mm512_castsi512_si128(vsum4));
}

With gcc -O3 -mavx512f this gives.

vshufi64x2      $238, %zmm0, %zmm0, %zmm1
vpaddd          %zmm1, %zmm0, %zmm1
vshufi64x2      $85, %zmm1, %zmm1, %zmm1
vpaddd          %zmm1, %zmm0, %zmm1
vpshufd         $78, %zmm1, %zmm1
vpaddd          %zmm0, %zmm1, %zmm1
vpshufd         $141, %zmm1, %zmm1
vpaddd          %zmm0, %zmm1, %zmm0
vmovd           %xmm0, %eax
ret

AVX512 uses vshufi64x2 instead of vpermf32x4 and KNC combines the permuting within lanes and the add with the {abcd} notation (e.g. vpaddd %zmm4{badc}, %zmm4, %zmm5). This is basically what is achieved using _mm256_hadd_epi32.

I forgot I already had seen this question for AVX512. Here is another solution.

For what it's worth here is intrinsics (untested) for KNC.

int foo(__m512i a) {
    __m512i vsum1 = _mm512_add_epi32(a,_mm512_permute4f128_epi32(a, 0xee));
    __m512i vsum2 = _mm512_add_epi32(a,_mm512_permute4f128_epi32(vsum1, 0x55));
    __m512i vsum3 = _mm512_add_epi32(a,_mm512_swizzle_epi32(vsum2, _MM_SWIZ_REG_BADC));
    __m512i vsum4 = _mm512_add_epi32(a,_mm512_swizzle_epi32(vsum3, _MM_SWIZ_REG_CADB));
    int32_t out[2];
    _mm512_packstorelo_epi32(out, vsum4);
    return out[0];
}

I don't see a difference between in functionality between KNC's _mm512_permute4f128_epi32(a,imm8) and AVX512's _mm512_shuffle_i32x4(a,a,imm8).

The main difference in this case is that _mm512_shuffle_epi32 generates vpshufd whereas _mm512_swizzle_epi32 does not. That appears to be an advantage of KNC over AVX512.

inline assembly of reduce operation for Xeon Phi

2 Answers2