What is the benefits of using vaddss instead of addss in scalar matrix addition?

Question

I have implemented scalar matrix addition kernel.

#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>

//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000


float   __attribute__(( aligned(32))) A[N][M],
        __attribute__(( aligned(32))) B[N][M],
        __attribute__(( aligned(32))) C[N][M];

int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
    clock_gettime(CLOCK_MONOTONIC,&tStart);

    for( i=0;i<N;i++){
        for(j=0;j<M;j++){
            C[i][j]= A[i][j] + B[i][j];
        }
    }

    clock_gettime(CLOCK_MONOTONIC,&tEnd);
    tTotal = (tEnd.tv_sec - tStart.tv_sec);
    tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
    if(tTotal<tBest)
        tBest=tTotal;
    } while(w++ < NUM_LOOP);

printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}

In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:

gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix

movss   xmm1, DWORD PTR A[rcx+rax]
addss   xmm1, DWORD PTR B[rcx+rax]
movss   DWORD PTR C[rcx+rax], xmm1

gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix

vmovss  xmm1, DWORD PTR A[rcx+rax]
vaddss  xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss  DWORD PTR C[rcx+rax], xmm1

AVX version gcc -O2 -mavx:

__m256 vec256;
for(i=0;i<N;i++){   
    for(j=0;j<M;j+=8){
        vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) ,  _mm256_load_ps(&B[i+1][j]));
        _mm256_store_ps(&C[i+1][j], vec256);
            }
        }

SSE version gcc -O2 -sse4.2::

__m128 vec128;
for(i=0;i<N;i++){   
    for(j=0;j<M;j+=4){
    vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) ,  _mm_load_ps(&B[i][j]));
    _mm_store_ps(&C[i][j], vec128);
            }
        }

In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.

using gcc -O2 , linux mint, skylake

UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.

To answer the title question (I can't fully parse the last part of the body): [GCC does what you said only when compiling at -O1](https://godbolt.org/g/T4xnCU). When targeting systems with AVX is [always a good idea to use the VEX versions of the legacy SSE instructions](https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf). — Margaret Bloom, Feb 19 '17 at 09:15
@MargaretBloom, no `gcc -O2` I added to the question. targeting is OK but I'm comparing pure `AVX` and `SSE` not AVX-256 with AVX-128. — Amiri, Feb 19 '17 at 09:51
@MargaretBloom, vectorization is enabled at by `-ftree-loop-vectorize` which is enabled by `-O3` but not `-O2`. This will even vectorized with `-O1 -ftree-loop-vectorize` — Z boson, Feb 19 '17 at 10:00
`vaddps x,x,mem` does not require `mem` to be aligned whereas `addps x,mem` does (`x` is a SIMD register (xmm,ymm, or zmm). That's one advantage of vex encoding. — Z boson, Feb 19 '17 at 10:04
@Zboson, Yes, you are right but memory is aligned. 16 boundary for SSE and 32 boundary for AVX — Amiri, Feb 19 '17 at 10:08
@Zboson Honestly I can't understand the point. I'd have expected the OP to use `-O3` when profiling (no need to nitpick on the, admittedly wrong, "only" part of my observation, IMO). To OP: I don't understand you either: "*The speed up of -mavx over msse4.2 is 2.7x.*" seems to be contradicted by "*Scalar-SSE is 2.7x faster than Scalar-AVX*", further you *explicitly* asked GCC to generate VEX versions (`GCC depresses SSEx instructions when -mavx is used. It generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.`) and I gave you a link explaining the rationale. — Margaret Bloom, Feb 19 '17 at 13:14
@MargaretBloom, I agree I don't get the point. The OPs claims are confusing and the update seems contradictory. I don't see any good reason in this case the scalar SSE or AVX code would make a significant difference. I can't reproduce the OPs results so far with GCC 6.2, Ubuntu 16.10, Skylake. I thought maybe the OP was [seeing this](http://stackoverflow.com/q/41303780/2542702). — Z boson, Feb 19 '17 at 13:36
@Zboson, "The speed up of `-mavx` over `-msse4.2` is 2.7x." it's for scalar program wich is compiled with these flags. I made a mistake in UPDATE part just edited it — Amiri, Feb 19 '17 at 13:43
It makes no sense `-O2 -msse4.2` should have about the same speed as `-O2 mavx`. In fact, so far the the first case is about 10% faster. I have no idea why the AVX version would be 2.7x faster. I don't observe this. — Z boson, Feb 19 '17 at 13:47
Can you add `__asm__ __volatile__ ( "vzeroupper" : : : );` right after main and test again? — Z boson, Feb 19 '17 at 13:53
gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix and gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix was added to the question body — Amiri, Feb 19 '17 at 13:54
@zboson, I just added `asm` and tested the results did not change. — Amiri, Feb 19 '17 at 13:57
I am out nearly out of ideas. The only thing left I can think of is if `clock_gettime` was leaving the upper half of an AVX register dirty. You could try `__asm__ __volatile__ ( "vzeroupper" : : : );` after each call to `clock_gettime()`. I doubt this is the problem though. — Z boson, Feb 19 '17 at 14:03
@Zboson, yes that was the point. I tested again and you are completely right. thanks. `-msse4.2` got 8 ns and `-mavx` got 9 ns. — Amiri, Feb 19 '17 at 14:09
You mean that solved the problem? What version of Linux Mint are you running. What Kernel. What version of glibc? — Z boson, Feb 19 '17 at 14:10
@Zboson, yes you can write the answer and I will accept it. the problem was not about avx and sse. `Linux mint 18`, kernel `4.4.0-53` — Amiri, Feb 19 '17 at 14:12
should I use `__asm__ __volatile__ ( "vzeroupper" : : : );` after each `clock_gettime`? this point will kill me. I have 200 variable in my research paper. and if it changes. I want to die. — Amiri, Feb 19 '17 at 14:16
I don't know the best solution. What about upgrading your Linux version? I don't have the problem on my Skylake system with Ubuntu 16.10. and glibc 2.24. — Z boson, Feb 19 '17 at 14:19
I don't know. I can't right now I should publish the paper. How should I address this problem?!!!! — Amiri, Feb 19 '17 at 14:20
could I upgrade from update manager and all my applications and library not change? — Amiri, Feb 19 '17 at 14:22
What version of Linux Mint are you running? Maybe you can update that? — Z boson, Feb 19 '17 at 14:27
I use Linux mint 18 Sarah. I think I can but the problem is I will lose all my results and tomorrow is my papers deadline. — Amiri, Feb 19 '17 at 14:28
BTW, the latest version of mint is 18.1. I have no idea if 18.1 will fix your problem. I think it still uses glibc 2.23. I don't use mint anymore because it's based on Ubuntu LTS so it's often old. — Z boson, Feb 19 '17 at 14:52
Yes, You are right. I used to run ubuntu on my laptop. I bought this new laptop and thought it might be a better idea to run a new OS. I liked Linux mint because I had no idea. I might change it to ubuntu 17.4 when it released. — Amiri, Feb 19 '17 at 14:59
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/136081/discussion-on-question-by-facked-developer-what-is-the-benefits-of-using-vaddss). — Bhargav Rao, Feb 19 '17 at 15:07
Sorry to belabor this but I just realized one solution is to only compile with AVX and not worry about non-vex encoding. You can't really test SSE only code on your system because you don't have a system with SSE only. You could try `-mprefer-avx128` if you want to compare 128-bit and 256-bit operations. The problem with using `__asm__ __volatile__ ( "vzeroupper" : : : );` is that it would crash on a system without AVX. That's why GCC won't let you do it except with asm. If you use that instruction you might as well compile with `-mavx`. — Z boson, Feb 19 '17 at 15:22
Can you do one call to `clock_gettime` outside of the do loop and then the asm only one after this. My idea is that the first time you call `clock_gettime` it does some initialization which makes the upper half dirty but that subsequent calls don't do this. — Z boson, Feb 19 '17 at 15:44
@Zboson, yes you are right. Out side the loop works. an the results showed no-vex is 1.25x faster than vex-encoding — Amiri, Feb 19 '17 at 17:05
@FackedDeveloper, okay we have narrowed it down. My guess is the first time you call a glibc function it calls some init code which in this case checks CPUID for AVX and then dirties the upper half. Before Skylake this was not a problem. If I am right you could do a `printf` right after `main` and then use `__asm__ __volatile__ ( "vzeroupper" : : : );` right after the `printf`. Can you check this? — Z boson, Feb 19 '17 at 19:49
@Zboson, `printf("hello"); __asm__ __volatile__ ( "vzeroupper" : : : );` right after main does not work — Amiri, Feb 19 '17 at 20:04
@FackedDeveloper, okay, thanks for checking. I was just thinking that might not work because `clock_gettime` may use another dynamic library (some require `rt`). What's does `ldd` show on your executable. This should show the libraries linked to your executable? The library that calls `clock_gettime` might not get loaded (and inited) until the first time it's called. — Z boson, Feb 19 '17 at 20:14
`ldd` for your executable on my system shows `linux-vdso.so.1`, `/lib/x86_64-linux-gnu/libc.so.6` and `/lib64/ld-linux-x86-64.so.2`. — Z boson, Feb 19 '17 at 20:16
`clock_gettime` seems to be in `librt` (`/lib/x86_64-linux-gnu/librt-2.24.so` on my system). — Z boson, Feb 19 '17 at 20:27
`ldd` showed ` `linux-vdso.so.1 => (0x00007ffd2efc1000)` `libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007efcc3a6d000)` `/lib64/ld-linux-x86-64.so.2 (0x00005558c4985000)` — Amiri, Feb 19 '17 at 20:29
@FackedDeveloper, okay, this is beyond my skills now. `objdump -T a.out` on my system shows `GLIBC_2.17 clock_gettime`. I am not sure how to trace down how `clock_gettime` is loaded. — Z boson, Feb 19 '17 at 20:32
from [here](http://juliusdavies.ca/posix_clocks/clock_realtime_linux_faq.html) : App call's glibc's clock_gettime(). glibc would make the syscall into the kernel The kernel would call sys_clock_gettime() sys_clock_gettime() calls getnstimeofday() getnstimeofday() would return the added combination of xtime and __get_nsec_offset() __get_nsec_offset() reads the clocksource hardware (in this case the TSC) and returns the time since xtime was last updated (its a little more complicated then this, but fundamentally that's what goes on). — Amiri, Feb 19 '17 at 20:34
@FackedDeveloper, thanks! (though that link is about eight years old). The point is that some initialization code is being called the first time you call `clock_gettime()`. Maybe it loads `librt` the first time. I am not sure. In any case it appears at this stage that the upper half of the AVX registers gets dirty. The only reason I care (why I am still commenting) is because somebody said "Seems like the bug is somewhere else" which annoys me. — Z boson, Feb 19 '17 at 20:49
@Zboson, That's OK. Don't be annoyed. That somebody might know many things or not. But the best point is you find out the problem that somebody could not. When the problem was solved, everyone can say yeah I knew that. But when there wasn't any answer somebody was silent. — Amiri, Feb 19 '17 at 21:34
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/136099/discussion-between-facked-developer-and-z-boson). — Amiri, Feb 19 '17 at 21:36
Do you know how to use `gdb`? You can [trace the problem](http://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake?noredirect=1#comment71857133_41303780) I think by putting a breakpoint right before the first call to `clock_getwtime()` and then stepping through until the AVX register gets dirty. I must admit that I have not used gdb in ages... — Z boson, Feb 21 '17 at 08:16

score 4 · Accepted Answer · edited May 23 '17 at 12:24

4

The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem. You can use

__asm__ __volatile__ ( "vzeroupper" : : : );

to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction.

BTW, it's Microsoft's fault that the hardware has this problem.

Update:

Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.

If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.

edited May 23 '17 at 12:24

Community

1
1

answered Feb 19 '17 at 14:24

Z boson

32,619
11
123
226

According to the ABI, every function that uses AVX should execute `vzeroupper` when its done. Seems like the bug is somewhere else. – fuz Feb 19 '17 at 15:27
@fuz, did you read the first link I pointed to? The problem goes away when clearing the upper part of the AVX register. I can't reproduce the problem on my system so I can't test it. The OP said the problem did not got away with `__asm__ __volatile__ ( "vzeroupper" : : : );` right after `main` which is what I would have expected but it goes away after when it's used after `clock_gettime`. In my answer I did not mention this because the only thing I am fairly certain about is that the problem is the upper half being dirty. Can we agree on that? – Z boson Feb 19 '17 at 15:33
Read the last few lines of the post you linked, it says basically the same thing I said: Someone must have used AVX instructions without executing `vzeroupper` afterwards. – fuz Feb 19 '17 at 15:44
@fuz, in that link the bug was in `_dl_runtime_resolve_avx(), /lib64/ld-linux-x86-64.so.2` – Z boson Feb 19 '17 at 15:51

What is the benefits of using vaddss instead of addss in scalar matrix addition?

1 Answers1

Linked