How to use fused multiply and add in AVX for 16 bit packed integers

Question

I know there it is possible to do multiply-and-add using a single instruction in AVX2. I want to use multiply-and-add instruction where each 256-bit AVX2 variable is packed with 16, 16-bit variables. For instance, consider the example below,

res=a0*b0+a1*b1+a2*b2+a3*b3

here each of res, a0, a1, a2, a3, b0, b1, b2, b3 are 16-bit variables. I have closely followed the discussion. Please find my code below to calculate the example shown above,

#include<stdio.h>
#include<stdint.h>
#include <immintrin.h>
#include<time.h>
#include "cpucycles.c"

#pragma STDC FP_CONTRACT ON

#define AVX_LEN 16

inline __m256i mul_add(__m256i a, __m256i b, __m256i c) { 
    return _mm256_add_epi16(_mm256_mullo_epi16(a, b), c);
}

void fill_random(int16_t *a, int32_t len){  //to fill up the random array

    int32_t i;

    for(i=0;i<len;i++){     
        a[i]=(int16_t)rand()&0xffff;
    }
}

void main(){


    int16_t a0[16*AVX_LEN], b0[16*AVX_LEN];
    int16_t a1[16*AVX_LEN], b1[16*AVX_LEN];
    int16_t a2[16*AVX_LEN], b2[16*AVX_LEN];
    int16_t a3[16*AVX_LEN], b3[16*AVX_LEN];
    int16_t res[16*AVX_LEN];


    __m256i a0_avx[AVX_LEN], b0_avx[AVX_LEN];
    __m256i a1_avx[AVX_LEN], b1_avx[AVX_LEN];
    __m256i a2_avx[AVX_LEN], b2_avx[AVX_LEN];
    __m256i a3_avx[AVX_LEN], b3_avx[AVX_LEN];

    __m256i res_avx[AVX_LEN];

    int16_t res_avx_check[16*AVX_LEN];
    int32_t i,j;

    uint64_t mask_ar[4]; //for unloading AVX variables
    mask_ar[0]=~(0UL);mask_ar[1]=~(0UL);mask_ar[2]=~(0UL);mask_ar[3]=~(0UL);
    __m256i mask;
    mask = _mm256_loadu_si256 ((__m256i const *)mask_ar);

    time_t t;
    srand((unsigned) time(&t));


    int32_t repeat=100000;

    uint64_t clock1, clock2, fma_clock;

    clock1=clock2=fma_clock=0;

    for(j=0;j<repeat;j++){
        printf("j : %d\n",j);

        fill_random(a0,16*AVX_LEN);// Genrate random data
        fill_random(a1,16*AVX_LEN);
        fill_random(a2,16*AVX_LEN);
        fill_random(a3,16*AVX_LEN);

        fill_random(b0,16*AVX_LEN);
        fill_random(b1,16*AVX_LEN);
        fill_random(b2,16*AVX_LEN);
        fill_random(b3,16*AVX_LEN);


        for(i=0;i<AVX_LEN;i++){ //Load values in AVX variables

            a0_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a0[i*16]));
            a1_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a1[i*16]));
            a2_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a2[i*16]));
            a3_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a3[i*16]));

            b0_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b0[i*16]));
            b1_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b1[i*16]));
            b2_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b2[i*16]));
            b3_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b3[i*16]));
        }

        for(i=0;i<AVX_LEN;i++){
            res_avx[i]= _mm256_set_epi64x(0, 0, 0, 0);
        }

        //to calculate a0*b0 + a1*b1 + a2*b2 + a3*b3

        //----standard calculation----
        for(i=0;i<16*AVX_LEN;i++){
            res[i]=a0[i]*b0[i] + a1[i]*b1[i] + a2[i]*b2[i] + a3[i]*b3[i];
        }


        //-----AVX-----

        clock1=cpucycles();


        for(i=0;i<AVX_LEN;i++){ //simple approach

            a0_avx[i]=_mm256_mullo_epi16(a0_avx[i], b0_avx[i]);
            res_avx[i]=_mm256_add_epi16(a0_avx[i], res_avx[i]);

            a1_avx[i]=_mm256_mullo_epi16(a1_avx[i], b1_avx[i]);
            res_avx[i]=_mm256_add_epi16(a1_avx[i], res_avx[i]);

            a2_avx[i]=_mm256_mullo_epi16(a2_avx[i], b2_avx[i]);
            res_avx[i]=_mm256_add_epi16(a2_avx[i], res_avx[i]);

            a3_avx[i]=_mm256_mullo_epi16(a3_avx[i], b3_avx[i]);
            res_avx[i]=_mm256_add_epi16(a3_avx[i], res_avx[i]);

        }

        /*
        for(i=0;i<AVX_LEN;i++){ //FMA approach

            res_avx[i]=mul_add(a0_avx[i], b0_avx[i], res_avx[i]);

            res_avx[i]=mul_add(a1_avx[i], b1_avx[i], res_avx[i]);
            res_avx[i]=mul_add(a2_avx[i], b2_avx[i], res_avx[i]);

            res_avx[i]=mul_add(a3_avx[i], b3_avx[i], res_avx[i]);

        }
        */

        clock2=cpucycles();
        fma_clock = fma_clock + (clock2-clock1);

        //-----Check----

        for(i=0;i<AVX_LEN;i++){ //store avx results for comparison
            _mm256_maskstore_epi64 (res_avx_check + i*16, mask, res_avx[i]);
        }

        for(i=0;i<16*AVX_LEN;i++){
            if(res[i]!=res_avx_check[i]){

                printf("\n--ERROR--\n");
                return;
            }   

        }
    }


    printf("Total time taken is :%llu\n", fma_clock/repeat);

}

The cpucycles code is from ECRYPT and given below,

#include "cpucycles.h"

long long cpucycles(void)
{
  unsigned long long result;
  asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
    : "=a" (result) ::  "%rdx");
  return result;
}

My gcc -version returns,

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)

I am using

 Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

When I run this on my computer, I get the following cycles for fma approach and simple approach respectively

FMA approach : Total time taken is :109
Simple approach : Total time taken is :141

As you can see, the FMA approach is slightly faster but I expected to be even more faster. I understand that in my sample code there are many memory accesses which might be the reason of deteriorating performance. But,

When I dump the assembly I see the almost similar instructions for both approaches. I do not see any fma instructions in the FMA version. I don't understand the reason. Is it beacause _mm256_mullo_epi16 instructions?
Is my approach correct?
Can you please help me to fix this?

I am new to AVX2 programming so it is highly possible that I ahve done something which is not very standard but I will be glad to answer something which is not clear. I thank you all for your help in advance.

x86 doesn't have integer FMA / MAC (multiply-accumulate) other than horizontal `pmaddubsw` / `pmaddwd`. FP-contract options have nothing to do with integer stuff; integer math is always exact. — Peter Cordes, Jul 31 '19 at 09:20
@PeterCordes I don't understand. I am talking about vector instructions. — Rick, Jul 31 '19 at 09:25
`[v]pmaddubsw` and `[v]pmaddwd` are SIMD SSE / AVX2 instructions. I meant SIMD-integer, of course, not scalar in GP registers. — Peter Cordes, Jul 31 '19 at 09:32
@PeterCordes Thank you. I understand now. I don't understand though that when I use FMA approach why the clockcylces reduces — Rick, Jul 31 '19 at 10:16
What compiler options did you use? Did you forget to enable optimization? `gcc -O3 -march=native`. Actually your compiler is way too old to know `-march=skylake` so `-march=native` won't properly set `-mtune` settings. But anyway, there shouldn't be a difference after optimization. — Peter Cordes, Jul 31 '19 at 19:33
@PeterCordes I used -mavx2 -O3 -mfma. After your comments I also added -march=native in gcc-6.5. And as you said it does not change anything. — Rick, Aug 01 '19 at 10:48
Oh right, the compiler knows your arrays are aligned so the default tuning of splitting unaligned 256-bit loads/stores wouldn't be a problem here. — Peter Cordes, Aug 01 '19 at 11:13

score 3 · Accepted Answer · answered Aug 01 '19 at 11:11

x86 doesn't have SIMD-integer FMA / MAC (multiply-accumulate) other than horizontal pmaddubsw / pmaddwd which add horizontal into wider integers. (Until AVX512IFMA _mm_madd52lo_epu64 or AVX512_4VNNIW _mm512_4dpwssd_epi32(__m512i, __m512ix4, __m128i *)).

FP-contract and -ffast-math options have nothing to do with SIMD-integer stuff; integer math is always exact.

I think your "simple" approach is slower because you're also modifying the input arrays and this doesn't get optimized away, e.g.

a0_avx[i] = _mm256_mullo_epi16(a0_avx[i], b0_avx[i]);

as well as updating res_avx[i].

If the compiler doesn't optimize that away, those extra stores might be exactly why it's slower than your mul_add function. rdtsc without a serializing instruction doesn't even have to wait for earlier instructions to execute, let alone retire or commit stores to L1d cache, but extra uops for the front-end are still more to chew through. At only 1 store per clock throughput, that could easily be a new bottleneck.

FYI, you don't need to copy your inputs to arrays of __m256i. Normally you'd just use SIMD loads on regular data. That's no slower than indexing arrays of __m256i. Your arrays are too big for the compiler to fully unroll and keep everything in registers (like it would for scalar __m256i variables).

If you'd just used __m256i a0 = _mm256_loadu_si256(...) inside the loop then you could have updated a0 without slowing your code down because it would just be a single local variable that can be kept in a YMM reg.

But I find it's good style to use new named tmp vars for most steps to make code more self-documenting. Like __m256i ab = ... or sum = .... You could reuse the same sum temporary for each a0+b0 and a1+b1.

You might also use a temporary for the result vector instead of making the compiler optimize away the updates of memory at res_avx[i] until the final one.

You can use alignas(32) int16_t a0[...]; to make plain arrays aligned for _mm256_load instead of loadu.

Your cpucycles() RDTSC function doesn't need to use inline asm. Use __rdtsc() instead.

thank you for the illustrated answer. You have written "you're also modifying the input arrays and this doesn't get optimized away", Can you please tell me what is the best way to update an AVX2 array like this. I appreciate your help. — Rick, Aug 03 '19 at 11:32
@Rick: Generally don't use arrays of `__m256d` in the first place, just use a few __m256d locals so they can be kept in registers. And when you do have arrays (whether they hold `double` or `__m256d`), use a tmp var instead of modifying them unless you actually *want* that side effect. — Peter Cordes, Aug 03 '19 at 11:40

How to use fused multiply and add in AVX for 16 bit packed integers

1 Answers1