I have implemented scalar matrix addition kernel.
#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>
//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000
float __attribute__(( aligned(32))) A[N][M],
__attribute__(( aligned(32))) B[N][M],
__attribute__(( aligned(32))) C[N][M];
int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for( i=0;i<N;i++){
for(j=0;j<M;j++){
C[i][j]= A[i][j] + B[i][j];
}
}
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
} while(w++ < NUM_LOOP);
printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}
In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:
gcc -O2 msse4.2
: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix
movss xmm1, DWORD PTR A[rcx+rax]
addss xmm1, DWORD PTR B[rcx+rax]
movss DWORD PTR C[rcx+rax], xmm1
gcc -O2 -mavx
: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix
vmovss xmm1, DWORD PTR A[rcx+rax]
vaddss xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss DWORD PTR C[rcx+rax], xmm1
AVX version gcc -O2 -mavx
:
__m256 vec256;
for(i=0;i<N;i++){
for(j=0;j<M;j+=8){
vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) , _mm256_load_ps(&B[i+1][j]));
_mm256_store_ps(&C[i+1][j], vec256);
}
}
SSE version gcc -O2 -sse4.2
::
__m128 vec128;
for(i=0;i<N;i++){
for(j=0;j<M;j+=4){
vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) , _mm_load_ps(&B[i][j]));
_mm_store_ps(&C[i][j], vec128);
}
}
In scalar program the speedup of -mavx
over msse4.2
is 2.7x. I know the avx
improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX
and SSE
the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat
reported.
using gcc -O2
, linux mint
, skylake
UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU
that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.