my intrinsic function in getting the dot product of an int array is slower than the normal code, what am I doing wrong?

Question

I'm trying to learn about intrinsic and how to properly utilize, and optimize it, I decided to implement a function to get the dot product of two arrays as a starting point to learn.

I create two functions to get the dot product of an array of integers int, one is coded in a normal way where you loop through every elements of the two arrays then perform multiplication with each element then add/accumulate/sum the resulting products to get the dot product.

The other uses intrinsic in a way where, I perform intrinsic operations on four elements of each array, I multiply each of them using _mm_mullo_epi32, then uses 2 horizontal add _mm_hadd_epi32 to get the sum of the current 4 elements, after that I add it up to the dot_product, then proceed to the the next four element, then repeat until I get to the calculated limit vec_loop, then I calculate the other remaining elements using the normal way to avoid calculating out of the array's memory, then I compare the performance of the two.

header file with the two types of dot product function:

// main.hpp
#ifndef main_hpp
#define main_hpp

#include <iostream>
#include <immintrin.h>

template<typename T>
T scalar_dot(T* a, T* b, size_t len){
    T dot_product = 0;
    for(size_t i=0; i<len; ++i) dot_product += a[i]*b[i];
    return dot_product;
}

int sse_int_dot(int* a, int* b, size_t len){
    
    size_t vec_loop = len/4;
    size_t non_vec = len%4;
    size_t start_non_vec_i = len-non_vec;

    int dot_prod = 0;

    for(size_t i=0; i<vec_loop; ++i)
    {
        __m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
        __m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
        va = _mm_mullo_epi32(va,vb);
        va = _mm_hadd_epi32(va,va);
        va = _mm_hadd_epi32(va,va);
        dot_prod += _mm_cvtsi128_si32(va);
    }

    for(size_t i=start_non_vec_i; i<len; ++i) dot_prod += a[i]*b[i];

    return dot_prod;
}

#endif

cpp code to measure the time taken of each function

// main.cpp
#include <iostream>
#include <chrono>
#include <random>
#include "main.hpp"

int main()
{
    // generate random integers
    unsigned seed = std::chrono::steady_clock::now().time_since_epoch().count();
    std::mt19937_64 rand_engine(seed);
    std::mt19937_64 rand_engine2(seed/2);
    std::uniform_int_distribution<int> random_number(0,9);

    size_t LEN = 10000000;

    int* a = new int[LEN];
    int* b = new int[LEN];

    for(size_t i=0; i<LEN; ++i)
    {
        a[i] = random_number(rand_engine);
        b[i] = random_number(rand_engine2);
    }

    #ifdef SCALAR
    int dot1 = 0;
    #endif
    #ifdef VECTOR
    int dot2 = 0;
    #endif

    // timing
    auto start = std::chrono::high_resolution_clock::now();
    #ifdef SCALAR
    dot1 = scalar_dot(a,b,LEN);
    #endif
    #ifdef VECTOR
    dot2 = sse_int_dot(a,b,LEN);
    #endif
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
    std::cout<<"proccess taken "<<duration.count()<<" nanoseconds\n";

    #ifdef SCALAR
    std::cout<<"\nScalar : Dot product = "<<dot1<<"\n";
    #endif
    #ifdef VECTOR
    std::cout<<"\nVector : Dot product = "<<dot2<<"\n";
    #endif

    return 0;
}

compilation:

intrinsic version : g++ main.cpp -DVECTOR -msse4.1 -o main.o
normal version : g++ main.cpp -DSCALAR -msse4.1 -o main.o

my machine:

Architecture: x86_64
CPU(s) : 1
CPU core(s): 4
Thread(s) per core: 1
Model name: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz
L1d cache: 96 KiB
L1i cache: 128 KiB
L2 cache: 2 MiB
some Flags : sse, sse2, sse4_1, sse4_2

In the main.cpp there are 10000000 elements of int array, when I compile the code above in my machine, it seems that the intrinsic function runs slower than the normal version, most of the time, intrinsic take around97529675 nanoseconds and sometimes even longer, while the normal code only takes around 87568313 nanoseconds, here I thought that my intrinsic function should run faster if the optimization flags is off, but turns out it is indeed somehow a little bit slower.

so my questions are:

why is my intrinsic function runs slower? (am I doing something wrong?)
how can I correct my intrinsic implementation, what is the proper way?
does the compiler auto vectorize/unroll the normal code even when the optimization flag is off
what is the fastest way to get the dot product given the specs of my machine?

I hope someone can help, thanks

Since you compile without optimisation I assume the compiler just leaves it all as is. The biggest problem I see at a quick glance is the constant horizontal addition, just add things to a `_mm128i` and only do the horizontal addition at the end of the loop. Out of curiosity, what happens if you enable optimisation? — Qubit, Jul 29 '21 at 09:36
@Qubit if I enable optimization they both speeds up but still the intrinsic is slower, sometimes it reaches e-1 of the speed of the other — 0xdeadbeef, Jul 29 '21 at 09:40
@Qubit, "just add things to a _mm128i and only do the horizontal addition at the end of the loop." so I need to allocate a temporary array? because I am not allowed to change the values of the other two arrays — 0xdeadbeef, Jul 29 '21 at 09:42
Just have `_mm128i _c` set to 0 at the start (before the loop) and sum everything into that. Then at the end add up the elements of `_c` into an integer. I would expect this to perform better on pretty much all CPUs, but I might be wrong, I didn't go looking at the timings. All of this of course assumes that the loop is not 4 elements. — Qubit, Jul 29 '21 at 09:44
@Qubit "Just have ```_mm128i _c``` set to 0 at the start (before the loop) and sum everything into that. Then at the end add up the elements of ```_c``` into an integer." how would I do that? btw the array could be of any size — 0xdeadbeef, Jul 29 '21 at 10:01
You do (c1,c2)=(a1,a2)*(b1,b2); s+=c1+c2 in each iteration. The suggestion is to do only (s1,s2)+=(a1,a2)*(b1,b2) in each iteration and then at then end s=s1+s2. — j6t, Jul 29 '21 at 10:26
Don't horizontal sum inside the inner loop. See [What is the fastest way to compute large dot products?](https://stackoverflow.com/q/17011666) for an FP `double` example. Also, you need to enable optimizations for anything to be worth benchmarking. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) Anti-optimized debug mode can slow down intrinsics even more than simple code that does more in one statement. — Peter Cordes, Jul 29 '21 at 10:57

0xdeadbeef · Accepted Answer · 2021-08-04T06:08:21.720

So with @Peter Cordes, @Qubit and @j6t suggestions, I tweeked the code a little bit, I now only do multiplication inside the loop, then I moved the horizontal addition outside the loop... It managed to increase the performance of the intrinsic version from around 97529675 nanoseconds, to around 56444187 nanoseconds which is significantly faster than my previous implementation, with the same compilation flags and 10000000 elements of int array.

here is the new function from main.hpp

int _sse_int_dot(int* a, int* b, size_t len){
        
    size_t vec_loop = len/4;
    size_t non_vec = len%4;
    size_t start_non_vec_i = len-non_vec;

    int dot_product;
    __m128i vdot_product = _mm_set1_epi32(0);

    for(size_t i=0; i<vec_loop; ++i)
    {
        __m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
        __m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
        __m128i vc = _mm_mullo_epi32(va,vb);
        vdot_product = _mm_add_epi32(vdot_product,vc);
    }

    vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
    vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
    dot_product = _mm_cvtsi128_si32(vdot_product);

    for(size_t i=start_non_vec_i; i<len; ++i) dot_product += a[i]*b[i];

    return dot_product;
}

If there is more to improve with this code, please point it out, for now I'm just gonna leave it here as the answer.

Coding style: don't use `_` names for locals, it's not very distinctive with lots of other underscores in function names. I sometimes use names like `__m128i va` for a vector related to scalar `a`. Also don't use leading underscores in function names, that's reserved for global scope names. — Peter Cordes, Jul 30 '21 at 01:06
`_mm_hadd_epi32` is not efficient, but it doesn't matter much because it's outside the loop. See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026). But otherwise yes, that's a sensible implementation. Benchmark with optimization enabled to see how it compares to auto-vectorized asm that your compiler generates from plain scalar C. — Peter Cordes, Jul 30 '21 at 01:08

my intrinsic function in getting the dot product of an int array is slower than the normal code, what am I doing wrong?

1 Answers1