1

I am compiling the same benchmark using gcc -O2 -march=native flags. However, Interesting thing is when I look at the objdump, it actually produce some instructions like vxorpd, etc, which I think should only appear when -ftree-vectorize is enabled (and -O2 should not enable this by default?) If I add -m32 flag to compile in 32 bit instruction, these packed instructions disappeared. Anyone met similar situations could give some explanations? Thanks.

yugr
  • 19,769
  • 3
  • 51
  • 96
PST
  • 59
  • 6

2 Answers2

3

XORPD is the classic SSE2 instruction that performs a bitwise logical XOR on two packed double-precision floating-point values.

VXORPD is the vector version of that same instruction. Essentially, it is the classic SSE2 XORPD instruction with a VEX prefix. That's what the "V" prefix means in the opcode. It was introduced with AVX (Advanced Vector Extensions), and is supported on any architecture that supports AVX. (There are actually two versions, the VEX.128-encoded version that works on 128-bit AVX registers, and the VEX.256-encoded version that works on 256-bit AVX2 registers.)

All of the legacy SSE and SSE2 instructions can have a VEX prefix added to them, giving them a three-operand form and allowing them to interact and schedule more efficiently with the other new AVX instructions. It also avoids the high cost of transitions between VEX and non-VEX modes. Otherwise, these new encodings retain identical behavior. As such, compilers will typically generate VEX-prefixed versions of these instructions whenever the target architecture supports them. Clearly, in your case, march=native is specifying an architecture that supports, at a minimum, AVX.

On GCC and Clang, you will actually get these instructions emitted even with optimization turned off (-O0), so you will certainly get them when optimizations are enabled. Neither the -ftree-vectorize switch, nor any of the other vectorization-specific optimization switches need to be on because this doesn't actually have anything to do with vectorizing your code. More precisely, the code flow hasn't changed, just the encoding of the instructions.

You can see this with the simplest code imaginable:

double Foo()
{
   return 0.0;
}
Foo():
        vxorpd  xmm0, xmm0, xmm0
        ret

So that explains why you're seeing VXORPD and its friends when you compile a 64-bit build with the -march=native switch.

That leaves the question of why you don't see it when you throw the -m32 switch (which means to generate code for 32-bit platforms). SSE and AVX instructions are still available when targeting these platforms, and I believe they will be used under certain circumstances, but they cannot be used quite as frequently because of significant differences in the 32-bit ABI. Specifically, the 32-bit ABI requires that floating-point values be returned on the x87 floating point stack. Since that requires the use of the x87 floating point instructions, the optimizer tends to stick with those unless it is heavily vectorizing a section of code. That's the only time it really makes sense to shuffle values from the x87 stack to SIMD registers and back again. Otherwise, that's a performance drain for little to no practical benefit.

You can see this too in action. Look at what changes in the output just by throwing the -m32 switch:

Foo():
        fldz
        ret

FLDZ is the x87 FPU instruction for loading the constant zero at the top of the floating-point stack, where it is ready to be returned to the caller.

Obviously, as you make the code more complicated, you are more likely to change the optimizer's heuristics and persuade it to emit SIMD instructions. You are far more likely still if you enable vectorization-based optimizations.

Community
  • 1
  • 1
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • Hi Cody Gray, Thanks for your reply. Just one more question to follow up. I saw a big speed up due to these vex prefixed instructions when compiling using fastmath. Do you have any idea how these vex-prefix instructions perform better than the original x87 instructions? Thanks – PST Jun 02 '16 at 10:27
  • @PST Well, the regular SSE instructions tend to be faster than the x87 instructions. There are multiple complicated reasons for that. One of the most significant is that the x87 FPU works off of a stack-based system, with all of the attendant limitations, whereas the SSE implementation uses registers. That means no time is wasted pushing/popping values on the stack, or exchanging values at different positions on the stack. Another reason that SSE is faster than x87 is simply it is a newer implementation and has been optimized accordingly. – Cody Gray - on strike Jun 07 '16 at 05:42
  • Then, my answer already explains why the VEX-prefixed SSE instructions are faster than regular SSE instructions. So you are getting the benefit of two performance improvements: first switching from x87 to SSE, and then switching from SSE to VEX-encoded SSE. The Intel engineers have to have been up to something the past 15-20 years. :-) – Cody Gray - on strike Jun 07 '16 at 05:44
1

Just to add to Cody Gray's very good answer, you may check gcc's internally enabled options by outputting to assembler and turning on -fverbose-asm.

For example:

gcc -O2 -fverbose-asm -S -o test.S test.c

will list in test.S all optimization options enabled at the chosen optimization level (here -O2).

Community
  • 1
  • 1
Rainer Keller
  • 355
  • 2
  • 9