I'm trying to learn about intrinsic and how to properly utilize, and optimize it, I decided to implement a function to get the dot product of two arrays as a starting point to learn.
I create two functions to get the dot product of an array of integers int
, one is coded in a normal way where you loop through every elements of the two arrays then perform multiplication with each element then add/accumulate/sum the resulting products to get the dot product.
The other uses intrinsic in a way where, I perform intrinsic operations on four elements of each array, I multiply each of them using _mm_mullo_epi32
, then uses 2 horizontal add _mm_hadd_epi32
to get the sum of the current 4 elements, after that I add it up to the dot_product, then proceed to the the next four element, then repeat until I get to the calculated limit vec_loop
, then I calculate the other remaining elements using the normal way to avoid calculating out of the array's memory, then I compare the performance of the two.
header file with the two types of dot product function:
// main.hpp
#ifndef main_hpp
#define main_hpp
#include <iostream>
#include <immintrin.h>
template<typename T>
T scalar_dot(T* a, T* b, size_t len){
T dot_product = 0;
for(size_t i=0; i<len; ++i) dot_product += a[i]*b[i];
return dot_product;
}
int sse_int_dot(int* a, int* b, size_t len){
size_t vec_loop = len/4;
size_t non_vec = len%4;
size_t start_non_vec_i = len-non_vec;
int dot_prod = 0;
for(size_t i=0; i<vec_loop; ++i)
{
__m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
__m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
va = _mm_mullo_epi32(va,vb);
va = _mm_hadd_epi32(va,va);
va = _mm_hadd_epi32(va,va);
dot_prod += _mm_cvtsi128_si32(va);
}
for(size_t i=start_non_vec_i; i<len; ++i) dot_prod += a[i]*b[i];
return dot_prod;
}
#endif
cpp code to measure the time taken of each function
// main.cpp
#include <iostream>
#include <chrono>
#include <random>
#include "main.hpp"
int main()
{
// generate random integers
unsigned seed = std::chrono::steady_clock::now().time_since_epoch().count();
std::mt19937_64 rand_engine(seed);
std::mt19937_64 rand_engine2(seed/2);
std::uniform_int_distribution<int> random_number(0,9);
size_t LEN = 10000000;
int* a = new int[LEN];
int* b = new int[LEN];
for(size_t i=0; i<LEN; ++i)
{
a[i] = random_number(rand_engine);
b[i] = random_number(rand_engine2);
}
#ifdef SCALAR
int dot1 = 0;
#endif
#ifdef VECTOR
int dot2 = 0;
#endif
// timing
auto start = std::chrono::high_resolution_clock::now();
#ifdef SCALAR
dot1 = scalar_dot(a,b,LEN);
#endif
#ifdef VECTOR
dot2 = sse_int_dot(a,b,LEN);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout<<"proccess taken "<<duration.count()<<" nanoseconds\n";
#ifdef SCALAR
std::cout<<"\nScalar : Dot product = "<<dot1<<"\n";
#endif
#ifdef VECTOR
std::cout<<"\nVector : Dot product = "<<dot2<<"\n";
#endif
return 0;
}
compilation:
- intrinsic version :
g++ main.cpp -DVECTOR -msse4.1 -o main.o
- normal version :
g++ main.cpp -DSCALAR -msse4.1 -o main.o
my machine:
- Architecture: x86_64
- CPU(s) : 1
- CPU core(s): 4
- Thread(s) per core: 1
- Model name: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz
- L1d cache: 96 KiB
- L1i cache: 128 KiB
- L2 cache: 2 MiB
- some Flags : sse, sse2, sse4_1, sse4_2
In the main.cpp
there are 10000000 elements of int
array, when I compile the code above in my machine, it seems that the intrinsic function runs slower than the normal version, most of the time, intrinsic take around97529675 nanoseconds
and sometimes even longer, while the normal code only takes around 87568313 nanoseconds
, here I thought that my intrinsic function should run faster if the optimization flags is off, but turns out it is indeed somehow a little bit slower.
so my questions are:
- why is my intrinsic function runs slower? (am I doing something wrong?)
- how can I correct my intrinsic implementation, what is the proper way?
- does the compiler auto vectorize/unroll the normal code even when the optimization flag is off
- what is the fastest way to get the dot product given the specs of my machine?
I hope someone can help, thanks