When compiling for CPUs that have AVX (such as with -march=sandy-bridge
), GCC seems to always prefer the AVX versions of simple, scalar floating-point instructions over the SSE versions. Such as, it uses vmulsd
instead of mulsd
.
I'm wondering, are there any particular performance-related reasons for this, or is it just some implementation detail of GCC that makes it easier/more natural for it to schedule such instructions? From what I can tell from the sources I have (mostly Agner's instruction tables), the AVX and SSE instructions seem to be equal in performance. I realize that AVX instructions are three-operand, but GCC seems to almost always only use the same destination register as one of the source operands anyway.