New AVX-instructions syntax

Question

I had a C code written with some intel-intrinsincs. After I compiled it first with avx and then with ssse3 flags, I got two quite different assembly codes. E.g:

AVX:

vpunpckhbw  %xmm0, %xmm1, %xmm2

SSSE3:

movdqa %xmm0, %xmm2
punpckhbw %xmm1, %xmm2

It's clear that vpunpckhbw is just punpckhbw but using the avx three operand syntax. But is the latency and the throughput of the first instruction equivalent to the latency and the throughput of the last ones combined? Or does the answer depend on the architecture I'm using? It's IntelCore i5-6500 by the way.

I tried to search for an answer in Agner Fog's instruction tables but couldn't find the answer. Intel specifications also didn't help (however, it's likely that I just missed the one I needed).

Is it always better to use new AVX syntax if possible?

Some days, don't you just long for the old days of *mov al,8* :-) — Neil, Jul 04 '16 at 15:12
@Neil Oh yes, I long for the days when `rep movsb` was considered SIMD. — fuz, Jul 04 '16 at 15:41
If you need to support AVX then the answer to your question is most likely moot, since there is a high performance penalty for switching between old-type (non-VEX) SSE instructions and VEX-encoded SSE/AVX instructions at run-time. Hence it's "all or nothing". — Paul R, Jul 04 '16 at 16:04
Look for VEX here: [link](https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf). I am using Intel Compiler with SSE intrinsics, and getting about 5% (average) performance improvement, when AVX is enabled (SSE intrinsics are compiled to VEX instructions). — Rotem, Jul 04 '16 at 16:12
@PaulR Nope, I don't need to support both legacy SSE code and AVX, so the penalty while switching between them is not really the problem. I'm just curious whether using VEX AVX syntax gives me any bonus performance. Otherway, I'll just stay with SSE. — Artyom, Jul 05 '16 at 06:58
@Rotem Thanks for the link. However, there is no penalty really. I have two different object files, one with VEX syntax, other with legacy SSE. And I'm using only one of them at a time. Are you sure that these are VEX instructions that give you performance improvement? It also could be just some instructions that appeared only in AVX and have no analogs in legacy SSE. E.g. 'vpinsrb' that only appeared in SSE 4.1 — Artyom, Jul 05 '16 at 07:00
Your CPU can only push 4 micro-fused micro-ops per clock cycle. In the AVX case the load can micro-fuse. In the SSE case it cannot. This could mean in a tight loop that the SSE case needs five micro-ops and the AVX case only four fused-micro-ops. That could have a big impact on performance. See [this](https://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62) for more details or wait for Peter Cordes to respond with more and better details. — Z boson, Jul 05 '16 at 08:59
I've seen some cases where just re-compiling legacy SSE code with `-mavx` gives a modest performance improvement, presumably due to the non-destructive VEX SSE instructions reducing the no of instructions required to do the same job. — Paul R, Jul 05 '16 at 09:31
@Artyom I think Paul is right - the non-destructive VEX SSE instructions is probably the reason for modest performance improvement. I am using SSE (up to) 4.1 intrinsics, and compare compiler flag /QxSSE4.2 versus flag /QxAVX on Core i7 2600. — Rotem, Jul 05 '16 at 12:40
I misread your question. You're not doing mem to reg but reg to reg mov. I am not sure but I think on more recent Intel processors this has zero latency. However, I think `movdqa %xmm0, %xmm2` still counts as an instruction. And only four instruction can be processed per cycle so the AVX case could allow another instruction per cycle. On the other hand it could [in some cases be worse](https://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp) I think. — Z boson, Jul 06 '16 at 06:49
Yeah, in [this answer](https://stackoverflow.com/questions/30719340/how-to-add-values-from-vector-to-each-other/31044910#31044910) it's written "On IvyBridge and later, the register-renaming stage handles reg-reg moves, and they happen with zero latency." So `movdqa %xmm0, %xmm2` has zero latency with a `i5-6500` but it can still affect the throughput since it still counts as an instruction. — Z boson, Jul 06 '16 at 07:57
Agner's instruction tables say the latency of `MOVDQA/U` for skylake is 0-1 and througput of 4. That's consistent of what I am saying. The latency is zero but you can still only push four per cycle. — Z boson, Jul 06 '16 at 08:09
Agner says "An eliminated move has zero latency and does not use any execution port. But is does consume bandwidth in the decoders." So it does not use an execution port either but still can counts as an instruction. See the section "Elimination of move instructions" for the SNB and IVB section of Agner's microarchitecture manual. — Z boson, Jul 06 '16 at 08:16

score 5 · Accepted Answer · edited May 23 '17 at 12:22

Is it always better to use new AVX syntax if possible?

I think the first question is to ask if folder instructions are better than a non-folder instruction pair. Folding takes a pair of read and modify instructions like this

vmovdqa %xmm0, %xmm2
vpunpckhbw %xmm2, %xmm1, %xmm1

and "folds" them into one combined instruction

vpunpckhbw  %xmm0, %xmm1, %xmm2

Since Ivy Bridge a register to register move instruction can have zero latency and can use zero execution ports. However, the unfolded instruction pair still counts as two instructions on the front-end and therefore can affect the overall throughput. The folded instruction however only counts as one instruction in the front-end which lowers the pressure on the front-end without any side effects. This could increase the overall throughput.

However, for memory to register moves the folding ~~can~~ may have a side effect (there is currently some debate about this) even if it lowers pressure on the front-end. The reason is that the out-of-order engine from the front-ends point of view only sees a folded instruction (assuming this answer is correct) and if for some reason it would be more optimal to reorder the memory read operation (since it does require execution ports and has latency) independently from the other operations in the folded instruction the out-of-order engine won't be able to take advantage of this. I observed this for the first time here.

For your particular operation the AVX syntax is always better since it folds the register to register move. However, if you had a memory to register move the folder AVX instruction could perform worse than the unfolded SSE instruction pair in some cases.

Note that, in general, it should still be better to use a vex-encoded instructions. But I think most compilers, if not all, now assume folding is always better so you have no way to control the folding except with assembly (not even with intrinsics) or in some cases by telling the compiler not to compile with AVX.

Yes, it's always better to use the non-destructive destination feature of VEX encoding to avoid reg-reg `mov` instructions. I don't think "fold" is the right word for this, though: Thinking of it as actually combining a `movdqa` instruction with an ALU operation is the wrong mental picture, IMO. It's not like folding a load into an ALU instruction, since register-renaming means the result of `punpckhbw %xmm1, %xmm2` was already being written to a different physical register than either of the inputs. For terminology, "mov elimination" is already taken, too :/ — Peter Cordes, Jul 06 '16 at 14:11
Not sure what you mean "on a Sandy Bridge Processor ... the [3 operand] instruction could perform worse", even in the reg-reg case. That's just wrong, unless the `movdqa` you're getting rid of took the right amount of space to align something later. Since `movqda` still takes an execution port on SnB (not IvB or later), VEX encoding to avoid it is an even bigger win. — Peter Cordes, Jul 06 '16 at 14:15
@PeterCordes, you're completely right about my comment on Sandy Bridge. I see you're point on terminology since folding implies micro-op fusion which applies to memory reads/writes but I think by analogy folding is fine. Folding with a reg reg move is equivalent to a mem reg move except it costs no micro-ops and uses no ports. — Z boson, Jul 06 '16 at 15:22

New AVX-instructions syntax

1 Answers1

Linked