Why is vpclmulqdq with a memory operand slower than movdqa + pclmulqdq?

Question

vpclmulqdq instruction has four operands and pclmulqdq has three operands, so I think vpclmulqdq can be used instead of movdqa + pclmulqdq, but the experiments result became slower.

But when I use vpaddd instead of movdqa + paddd, I get faster result. So I'm confused about this question. The code use paddd instructions like this:

movdqa %xmm0, %xmm8          # slower
movdqa %xmm0, %xmm9
movdqa %xmm0, %xmm10
movdqa %xmm0, %xmm11
paddd (ONE),  %xmm8
paddd (TWO),  %xmm9
paddd (THREE),  %xmm10
paddd (FOUR),  %xmm11

vpaddd (ONE), %xmm0, %xmm8   # faster
vpaddd (TWO), %xmm0, %xmm9
vpaddd (THREE), %xmm0, %xmm10
vpaddd (FOUR), %xmm0, %xmm11

The code use pclmulqdq instructions like:

movdqa %xmm15, %xmm1               # faster
pclmulqdq $0x00, (%rbp), %xmm1
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
movdqa %xmm14, %xmm3
pclmulqdq $0x00, 16(%rbp), %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11

vpclmulqdq $0x00, (%rbp), %xmm15, %xmm1   # slower
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
vpclmulqdq $0x00, 16(%rbp), %xmm14, %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11

Other question: When I use unaligned data, how to write code like pxor (%rdi), %xmm0? (editor's note: removed from title because it's a separate question and because there's no better answer other than aligning pointers for the main part of the loop.)

My data has 16-bit (2-byte) alignment. But sometimes I need to load data and then perform xor operation. So I can't write code like this:

pxor (%rdi), %xmm8     # would segfault from misaligned %rdi
pxor 16(%rdi), %xmm9
pxor 32(%rdi), %xmm10
pxor 48(%rdi), %xmm11

I change my code, now the code is correct, but I think may not be very high efficiency, so what should I do?

movdqu (%rdi), %xmm0
movdqu 16(%rdi), %xmm13
movdqu 32(%rdi), %xmm14
movdqu 48(%rdi), %xmm15

pxor %xmm0, %xmm8
pxor %xmm13, %xmm9
pxor %xmm14, %xmm10
pxor %xmm15, %xmm11

What hardware are you running on? Different CPUs have different behaviour for micro-fusion of load+ALU. [`vpaddd` can micro-fuse a load as long as it uses a non-indexed addressing mode on Intel Sandybridge-family.](https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes#comment76198723_31027695). On Haswell and later, pclmul can't micro-fuse a load; it decodes as an extra uop. On IvB and earlier, it's micro-coded as 18 uops. — Peter Cordes, Oct 25 '17 at 08:54
re: the second part: without AVX, there's not really any workaround for the front-end bottleneck of needing a separate `movdqu` with unaligned data. Can you do the first few iterations scalar (or with an unaligned vector) so the pointers are aligned inside your loop? Or can you align your data to 16 bytes instead of 16 bits (2 bytes)? — Peter Cordes, Oct 25 '17 at 08:55
This is really 2 separate questions, which is frowned upon: [ask]. But for the second part, see https://stackoverflow.com/questions/34306933/vectorizing-with-unaligned-buffers-using-vmaskmovps-generating-a-mask-from-a-m for some discussion. Like I said, ideally do the first partial vector unaligned, then aligned over the main part, then unaligned for the end again. — Peter Cordes, Oct 25 '17 at 09:02
If you're on an Intel CPU, maybe you're running into uop-cache issues with all those 2-uop `aes` instructions, and 2 separate instructions spreads those uops over more code-size? IDK, unlikely. You'd need to investigate further with performance counters. to see what kind of bottleneck you're having. — Peter Cordes, Oct 25 '17 at 09:14
Thank you very much, I'm running on intel Haswell CPU, how to know whether a instruction can micro-fuse? and when I use two instuction movdqa %xmm0, %xmm8 and paddd (ONE), %xmm8. micro-fuse can't happen. but when I use vpaddd (ONE), %xmm0, %xmm8, the instruction can micro-fuse a load? thank you! — Bai, Oct 27 '17 at 01:09
`paddd (ONE), %xmm8` can micro-fuse as well, unless `ONE` expands to an indexed addressing mode. See my answer on https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes. So `movdqa` + `paddd` is 2 uops for the front-end, but vpaddd is only one. IDK why it would be slower on Haswell write more compact code for pclmul. Other than front-end effects, it should be break-even. How much slower? Ideally use performance counters to measure in clock cycles. — Peter Cordes, Oct 27 '17 at 01:50
Please tell me how to use performantce counters to measure in clock cycles. I always use oprofile to measure. and I can't find the lows about instructions and assembly code. Some of the experimental results made me feel very strange. Thank you — Bai, Oct 27 '17 at 06:24
See for example https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all/44193770#44193770, where my answer shows using `perf` to count cycles and uops to demonstrate mov-elimination. Also related [IACA for static analysis](https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it). See other links in https://stackoverflow.com/tags/x86/info, especially [Agner Fog's instruction tables and microarch pdf](http://agner.org/optimize), but beware that there are some subtle things that he doesn't mention for Haswell and later. — Peter Cordes, Oct 27 '17 at 06:39

Why is vpclmulqdq with a memory operand slower than movdqa + pclmulqdq?

0 Answers0