I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms:
Alg_A - No Vector support:
#include <stdio.h>
#define SIZE 1000000000
int main() {
printf("Algorithm with NO vector support\n");
int a[] = {1, 2, 3, 4};
int b[] = {5, 6, 7, 8};
int i = 0;
printf("Running loop %d times\n", SIZE);
for (; i < SIZE; i++) {
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];
}
printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);
}
Alg_B - With Vector support:
#include <stdio.h>
#define SIZE 1000000000
typedef int v4_i __attribute__ ((vector_size(16)));
union Vec4 {
v4_i v;
int i[4];
};
int main() {
printf("Algorithm with vector support\n\n");
union Vec4 a, b;
a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
int i = 0;
printf("Running loop %d times\n", SIZE);
for (; i < SIZE; i++) {
a.v = a.v + b.v;
}
printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);
}
The compilation was done as follows:
Alg_A :
gcc -ggdb -mno-sse -mno-sse2 -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -c non_vector_support.c
gcc non_vector_support.o -o non_vector_support
Alg_B :
gcc -ggdb -c vector_support.c
gcc vector_support.o -o vector_support
Looking at the generated code for both algorithms, I can see that the compilation did not do any tricks like 'auto-vectorization' (e.g. could not see xmm
registers):
Alg_A :
for (; i < SIZE; i++) {
74: eb 30 jmp a6 <main+0xa6>
a[0] = a[0] + b[0];
76: 8b 55 d8 mov -0x28(%rbp),%edx
79: 8b 45 e8 mov -0x18(%rbp),%eax
7c: 01 d0 add %edx,%eax
7e: 89 45 d8 mov %eax,-0x28(%rbp)
a[1] = a[1] + b[1];
81: 8b 55 dc mov -0x24(%rbp),%edx
84: 8b 45 ec mov -0x14(%rbp),%eax
87: 01 d0 add %edx,%eax
89: 89 45 dc mov %eax,-0x24(%rbp)
a[2] = a[2] + b[2];
8c: 8b 55 e0 mov -0x20(%rbp),%edx
8f: 8b 45 f0 mov -0x10(%rbp),%eax
92: 01 d0 add %edx,%eax
94: 89 45 e0 mov %eax,-0x20(%rbp)
a[3] = a[3] + b[3];
97: 8b 55 e4 mov -0x1c(%rbp),%edx
9a: 8b 45 f4 mov -0xc(%rbp),%eax
9d: 01 d0 add %edx,%eax
9f: 89 45 e4 mov %eax,-0x1c(%rbp)
int a[] = {1, 2, 3, 4};
int b[] = {5, 6, 7, 8};
int i = 0;
printf("Running loop %d times\n", SIZE);
for (; i < SIZE; i++) {
a2: 83 45 d4 01 addl $0x1,-0x2c(%rbp)
a6: 81 7d d4 ff c9 9a 3b cmpl $0x3b9ac9ff,-0x2c(%rbp)
ad: 7e c7 jle 76 <main+0x76>
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];
}
printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);
Alg_B :
for (; i < SIZE; i++) {
74: eb 16 jmp 8c <main+0x8c>
a.v = a.v + b.v;
76: 66 0f 6f 4d d0 movdqa -0x30(%rbp),%xmm1
7b: 66 0f 6f 45 e0 movdqa -0x20(%rbp),%xmm0
80: 66 0f fe c1 paddd %xmm1,%xmm0
84: 0f 29 45 d0 movaps %xmm0,-0x30(%rbp)
union Vec4 a, b;
a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
int i = 0;
printf("Running loop %d times\n", SIZE);
for (; i < SIZE; i++) {
88: 83 45 cc 01 addl $0x1,-0x34(%rbp)
8c: 81 7d cc ff c9 9a 3b cmpl $0x3b9ac9ff,-0x34(%rbp)
93: 7e e1 jle 76 <main+0x76>
a.v = a.v + b.v;
}
printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);
And when I run the programs, I was hoping to see an improvement in speed by a factor of 4 however, the gain appears to be on average =~ 1s for this size of data and if increased the SIZE to around 8000000000 the gain is =~ 5s. This is the timing for the value in the above code:
Alg_A :
Algorithm with NO vector support
Running loop 1000000000 times
A: [705032705 1705032706 -1589934589 -589934588]
real 0m3.279s
user 0m3.282s
sys 0m0.000s
Alg_B :
Algorithm with vector support
Running loop 1000000000 times
A: [705032705 705032706 -1589934589 -589934588]
real 0m2.609s
user 0m2.607s
sys 0m0.004s
Counting the overhead associated to the loop. I ran the an empty loop for the given SIZE and obtained =~ 2.2s on avg. Which gives me an average speed up of around 2.5 times.
Have i missed something in the code or compilation lines? Or, is this suppose to be correct and in which case would someone know why isn't there a gain in 4 times in speed if I am exploiting 4 data points and 1 instruction at each iteration?
Thanks in advance.