Does using mix of pxor and xorps affect performance?

Question

I've come across a fast CRC computation using PCLMULQDQ implementation. I see, that guys mix pxor and xorps instructions heavily like in the fragment below:

movdqa  xmm10, [rk9]
movdqa  xmm8, xmm0
pclmulqdq xmm0, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm0

movdqa  xmm10, [rk11]
movdqa  xmm8, xmm1
pclmulqdq xmm1, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm1

Is there any practical reason for this? Performance boost? If yes, then what lies beneath this? Or maybe it's just a sort of coding style, for fun?

`xorps` is a three-byte instruction, while `pxor` takes four bytes. Other than that, Agner Fog's [instruction tables](http://www.agner.org/optimize/instruction_tables.pdf) and [microarchitecture manuals](http://www.agner.org/optimize/microarchitecture.pdf) indicate that it doesn't hurt on AMD, since the `xorps` is treated as integer-domain. This *could* hurt performance on pre-skylake Intel though, as `xorps` can't use as many execution units there, and there may be bypass delays. — EOF, Oct 03 '16 at 08:39
@EOF: I'm guessing it's tuned for Intel SnB/IvB, based on the date (and that it's written by Intel). Alignment for the uop cache seems like the best guess, but maybe there's something going on with avoiding a resource conflict to not delay the next PCLMUL. — Peter Cordes, Oct 03 '16 at 10:00

score 12 · Accepted Answer · edited May 23 '17 at 10:34

TL:DR: it looks like maybe some microarch-specific tuning for this specific code sequence. There's nothing "generally recommended" about it that will help in other cases.

On further consideration, I think @Iwillnotexist Idonotexist's theory is the most likely: this was written by a non-expert who thought this might help. The register allocation is a big clue: many REX prefixes could have been avoided by choosing all the repeatedly-used registers in the low 8.

XORPS runs in the "float" domain, on some Intel CPUs (Nehalem and later), while PXOR always runs in the "ivec" domain.

Since wiring every ALU output to every ALU input for forwarding results directly would be expensive, CPU designers break them up into domains. (Forwarding saves the latency of writing back to the register file and re-reading). A domain-crossing can take an extra 1 cycle of latency (Intel SnB-family), or 2 cycles (Nehalem).

Further reading: my answer on What's the difference between logical SSE intrinsics?

Two theories occur to me:

Whoever wrote this thought that PXOR and XORPS would give more parallelism, because they don't compete with each other. (This is wrong: PXOR can run on all vector ALU ports, but XORPS can't).
This is some very cleverly tuned code that creates a bypass delay on purpose, to avoid a resource conflicts that might delay the execution of the next PCLMULQDQ. (or as EOF suggests, code-size / alignment might have something to do with it).

The copyright notice on the code says "2011-2015 Intel", so it's worth considering the possibility that it's somehow helpful for some recent Intel CPU, and isn't just based on a misunderstanding of how Intel CPUs work. Nehalem was the first CPU to include PCLMULQDQ at all, and this is Intel so if anything it'll be tuned to do badly on AMD CPUs. The code history isn't in the git repo, only the May 6th commit that added the current version.

The Intel whitepaper (from Dec 2009) that it's based on used PXOR only, not XORPS, in its version of the 2x pclmul / 2x xor block.

Agner Fog's table doesn't even show a number of uops for PCLMULQDQ on Nehalem, or which ports they require. It's 12c latency, and one per 8c throughput, so it might be similar to Sandy/Ivybridge's 18 uop implementation. Haswell makes it an impressive 3 uops (2p0 p5), while it runs in only 1 uop on Broadwell (p0) and Skylake (p5).

XORPS can only run on port5 (until Skylake where it also runs on all three vector ALU ports). On Nehalem has 2c bypass delay when one of its input comes from PXOR. On SnB-family CPUs, Agner Fog says:

In some cases, there is no bypass delay when using the wrong type of shuffle or Boolean instruction.

So I think there's actually no extra bypass delay for forwarding from PXOR -> XORPS on SnB, so the only effect would be that it can only run on port 5. On Nehalem, it might actually delay the XORPS until after the PSHUFBs were done.

In the main unrolled loop, there's a PSHUFB after the XORs, to set up the inputs for the next PCLMUL. SnB/IvB can run integer shuffles on p1/p5 (unlike Haswell and later where there's only one shuffle unit on p5. But it's 256b wide, for AVX2).

Since competing for the ports needed to set up the input for the next PCLMUL doesn't seem useful, my best guess is code size / alignment if this change was done when tuning for SnB.

On CPUs where PCLMULQDQ is more than 4 uops, it's microcoded. This means each PCLMULQDQ requires an entire uop cache line to itself. Since only 3 uop cache lines can map to the same 32B block of x86 instructions, this means that much of the code won't fit in the uop cache at all on SnB/IvB. Each line of the uop cache can only cache contiguous instructions. From Intel's optimization manual:

All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.

This sounds like a very similar issue to having integer DIV in a loop: Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs. With the right alignment, you can get it to run out of the uop cache (the DSB in Intel performance counter terminology). @Iwillnotexist Idonotexist did some useful testing on a Haswell CPU of micro-coded instructions, showing that they prevent running from the loopback buffer. (LSD in Intel terminology).

On Haswell and later, PCLMULQDQ is not microcoded, so it can go in the same uop cache line with other instructions before or after it.

For previous CPUs, it might be worth trying to tweak the code to bust the uop cache in fewer places. OTOH, switching between uop cache and legacy decoders might be worse than just always running from the decoders.

Also IDK if such a big unroll is really helpful. It probably varies a lot between SnB and Skylake, since microcoded instructions are very different for the pipeline, and SKL might not even bottleneck on PCLMUL throughput.

This code was written for Westmere, as the PDF itself claims. I happen to think that it was probably the PhD student who wrote it, and that he did **not** know precisely what he was doing. Evidence: 1. Random use of `pxor/pxor` instead of `pxor/xorps`. 2. No use of `mov[au]ps` for memory loads. 3. Awful regalloc, esp. of `xmm10`, increasing nearly all insn sizes by 1 byte. 4. `pclmulqdq` takes 18 uops on Westmere, best-case throughput is 1 every 8c, and encodes w/ prefixes to 7 bytes, so micro-optimizations like alignment and port scheduling are very premature. `pxor/xorps` here is cargo cult. — Iwillnotexist Idonotexist, Oct 06 '16 at 01:22
@IwillnotexistIdonotexist: hrm, yeah I noticed some questionable register allocation, too. But the code was copyright 2011-2015, while the PDF was published in 2009. The PDF makes no reference to this implementation of the code, which is why it's plausible that this was written with a later CPU in mind. Esp. since we're only seeing the 2015 version, not even the 2011 version. But yes, it doesn't look like good code. It's still possible that this somehow helps on some CPU, but I think your theory is probably more likely, and it's just crap. — Peter Cordes, Oct 06 '16 at 01:34
I basically wrote this answer as a thinking-out-loud brain-dump while entertaining the possibility that this wasn't just stupid. — Peter Cordes, Oct 06 '16 at 01:34
But damn do I like reading your answers. You practically own the x86 micro-optimizations territory on SO. I wonder how you have such a flair for finding good x86 asm questions! — Iwillnotexist Idonotexist, Oct 06 '16 at 01:37
@IwillnotexistIdonotexist: My question-feed is mostly just assembly / x86 / sse / avx / computer-architecture / lock-free / stdatomic, so I see them all and answer the good ones. There are lots of people writing good C / C++ answers, and I'd never be able to keep up with the volume of questions in those tags. (Plus, when I do look at micro-optimization C/C++ questions, writing up an answer usually takes a really long time, comparing the asm for all the random good and bad ideas from other answers. So I limit myself to a question feed I can keep up with, because I can't *not* look at things.) — Peter Cordes, Oct 06 '16 at 01:46
Although I am starting to train myself to just move on from the really boring newbie asm questions asking a minor variation on the same question for the hundredth time, and the stupid walls of 16-bit DOS code with 5 different bugs. It boggles my mind why people think it's ok to bother others with their problems when they haven't even used a debugger. — Peter Cordes, Oct 06 '16 at 01:53
@PeterCordes, does this https://godbolt.org/z/PYcWK1rMs look like a missed optimization of MSVC to emit `xorps` for no reason, since `punpcklbw` is integer domain? — Alex Guteniev, Sep 10 '21 at 13:49
@AlexGuteniev: For Nehalem, perhaps. For current CPUs, no. xor-zeroing is special anyway, although on AMD CPUs it does still need an execution port to write a zero. SnB family can foward efficiently from FP booleans to SIMD-integer, regardless of which execution port the boolean ran on (e.g. on Skylake where they're not limited to port 5). It's not "for no reason", `xorps` saves code size. (Or at least it's not for no *benefit*. IDK why MSVC decided to do that.) — Peter Cordes, Sep 10 '21 at 14:26

Does using mix of pxor and xorps affect performance?

1 Answers1

Linked