13

vextracti128 and vextractf128 have the same functionality, parameters, and return values. In addition one is AVX instruction set while the other is AVX2. What is the difference?

Paul R
  • 208,748
  • 37
  • 389
  • 560
user2813757
  • 141
  • 1
  • 3
  • 2
    Vs2012 c++ has a bug, when you compile a [tag:vextracti128] instruction occasionally wrong inversion two registers. Compile the vextractf128 instruction is correct. Vs2013 c++ seem to be right. – user2813757 Nov 21 '13 at 06:00

2 Answers2

15

vextracti128 and vextractf128 have not only the same functionality, parameters, and return values. They have the same instruction length. And they have the same throughput (according to Agner Fog's optimization manuals).

What is not completely clear is their latency values (performance in tight loops with dependency chains). Latency of instructions themselves is 3 cycles. But after reading section 2.1.3 ("Execution Engine") of Intel Optimization Manual we may suspect that vextracti128 should get additional 1 clock delay when working with floating point data and vextractf128 should get additional 1 clock delay when working with integer data. Measurements show that this is not true and latency always remains exactly 3 cycles (at least for Haswell processors). And as far as I know this is not documented anywhere in the Optimization Manual.

Still instruction set is only an interface to processor. Haswell is the only implementation of this interface containing both these instructions (for now). We could ignore the fact that implementations of these instructions are (most likely) identical. And use these instructions as intended - vextracti128 for integer data and vextractf128 for FP data. (If we only need to reorder data without performing any int/FP operations, the obvious choice is vextractf128 as it is supported by several older processors). Also experience shows that Intel sometimes decreases performance of some instructions in next generations of CPUs, so it would be wise to observe these instructions' affinity to avoid any possible speed degradation in the future.

Since Intel Optimization Manual is not very detailed describing relationship between int/FP domains for SIMD instructions, I've made some more measurements (on Haswell) and got some interesting results:


Shuffle instructions

There is no additional delay for any transitions between SSE integer and shuffle instructions. And there is no additional delay for any transitions between SSE FP and shuffle instructions. (Though I didn't test every instruction). For example you could insert such "obviously integer" instruction as pshufb between two FP instructions with no extra delay. Inserting shufpd in the middle of integer code also gives no extra delay.

Since vextracti128 and vextractf128 are executed by shuffle unit, they also have this "no delay" property.

This may be useful to optimize mixed int+FP code. If you need to reinterpret FP data as integers and at the same time shuffle the register, just make sure all FP instructions stand before the shuffle and all integer instructions are after it.


FP logical instructions

andps and other FP logical instructions also have the property of ignoring FP/int domains.

If you add integer logical instruction (like pand) into FP code, you get additional 2 cycle delay (one to get to int domain and other one to get back to FP). So the obvious choice for SIMD FP code is andps. The same andps may be used in the middle of integer code without any delays. Even better is to use such instructions right in between int and FP instructions. Interestingly, FP logical instructions are using the same port number 5 as all shuffle instructions.


Register access

Intel Optimization Manual describes bypass delays between producer and consumer micro-ops. But it does not say anything how micro-ops interact with registers.

This piece of code needs only 3 clocks per iteration (just as required by vaddps):

    vxorps ymm7, ymm7, ymm7
_benchloop:
    vaddps ymm0, ymm0, ymm7
    jmp _benchloop

But this one needs 2 clocks per iteration (1 more than needed for vpaddd):

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpaddd ymm0, ymm0, ymm7
    jmp _benchloop

The only difference here are calculations in integer domain instead of FP domain. To get 1 clock/iteration we need to add an instruction:

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpand ymm6, ymm7, ymm7
    vpaddd ymm0, ymm0, ymm6
    jmp _benchloop

Which hints that (1) all values stored in SIMD registers belong to FP domain, and (2) reading from SIMD register increases integer operation's latency by one. (The difference between {ymm0, ymm6} and ymm7 here is that ymm7 is stored in some scratch memory and works as real "register" while ymm0 and ymm6 are temporary and are represented by state of internal CPU's interconnections rather than some permanent storage, so ymm0 and ymm6 are not "read" but just passed between micro-ops).

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • 1
    Sandybridge's register file redesign removed ROB read port stalls. That advice to keep values live in the forwarding network is now obsolete. The loop can only run 1 iteration per 2 cycles because taken branches have a throughput of one per 2 cycles. (I tested on SnB, with Linux `perf` for counting insns/cycle. Agner Fog lists `jmp near`'s recip throughput as 1-2c for Haswell, vs. 2c for SnB, so maybe you can get one cycle per iteration on HSW.) But I think if either loop is 1c, both non-FP loops are. Otherwise nice answer. – Peter Cordes Jul 17 '15 at 04:40
  • Update on my previous comment: I'm pretty sure SnB can run tiny loops at one per clock (just like most CPUs), despite what Agner Fog's table says for jump-throughput numbers. Probably that one per 2 clock throughput only applies to non-loop branches. – Peter Cordes Apr 09 '17 at 02:02
  • Update on my previous comment: I'm pretty sure SnB can run tiny loops at one per clock (just like most CPUs), despite what Agner Fog's table says for jump-throughput numbers. Probably that one per 2 clock throughput only applies to non-loop branches. See also [this Q&A about loops with an odd number of uops issuing at less than 4 uops per clock from the LSD (loop buffer)](http://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of). Execution can of course be limited by the details of the loop body, like in this case. – Peter Cordes Apr 09 '17 at 02:21
  • I tried to reproduce your results on Haswell. The VPAND+VPADDD consistently runs at 1 iter/clock. The just-VPADDD version sometimes does and sometimes doesn't; if you get lucky, it runs exactly as fast as the two-instruction version. But other times, it runs at 1 iter per ~1.5 cycles, or 1 per ~2 cycles. I didn't see any other patterns. (I don't have easy access to a Haswell with perf counters, so I modified the loop to use `dec ecx/jnz` (which micro-fuses into a single uop just like `jmp`. I looked at times to run 2 billion iterations, with repeated trials since I'm on a noisy VM.) – Peter Cordes Apr 09 '17 at 02:48
  • Initializing `ymm7` with something other than xor-zeroing doesn't seem to change things, either. (I tried `vcmpeqw same,same` to get all-ones, since [xor-zeroing is special](http://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and)). It's really puzzling, since your `vpand` still has to read `ymm7` from the physical register file, not the forwarding network. It's not part of the loop-carried dependency chain, though, which is the obvious difference. Changing VPADDD to a boolean doesn't matter. – Peter Cordes Apr 09 '17 at 02:57
  • BTW, **I *can't* reproduce this on Skylake**. Multiple runs of `perf stat` on your original loop (with `jmp`) and on my 2-billion iterations version shows it runs at exactly 1 iteration per clock every time, to within 0.01% **Whatever sub-optimal pattern Haswell sometimes falls into, Skylake doesn't have that problem.** – Peter Cordes Apr 09 '17 at 03:01
3

Good question - it looks like the AVX instruction vextractf128 is intended for any vector type (int, float, double) while the AVX2 instruction vextracti128 is intended for int vector only. I recommend using the latter when you have AVX2 and integer vectors, in case it offers better performance in some cases, otherwise use the former.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Thank you for your answer. I did not find vextracti128 latency and throughput data. Now, I assume that they are consistent functionality, so I used vextractf128, because it is compatible with the older CPU. vextracti128 only available on the cpu haswell and later. In the latest intel (july 2013) 64-ia-32-architectures-optimization-manual.pdf there are no latency and throughput data about vextracti128. Once vextracti128 integer data processing has additional performance advantage, I will use it, but does not now. Thanks. – user2813757 Oct 01 '13 at 12:22