vpclmulqdq
instruction has four operands and pclmulqdq
has three operands, so I think vpclmulqdq
can be used instead of movdqa + pclmulqdq
, but the experiments result became slower.
But when I use vpaddd
instead of movdqa + paddd
, I get faster result. So I'm confused about this question. The code use paddd
instructions like this:
movdqa %xmm0, %xmm8 # slower
movdqa %xmm0, %xmm9
movdqa %xmm0, %xmm10
movdqa %xmm0, %xmm11
paddd (ONE), %xmm8
paddd (TWO), %xmm9
paddd (THREE), %xmm10
paddd (FOUR), %xmm11
vpaddd (ONE), %xmm0, %xmm8 # faster
vpaddd (TWO), %xmm0, %xmm9
vpaddd (THREE), %xmm0, %xmm10
vpaddd (FOUR), %xmm0, %xmm11
The code use pclmulqdq instructions like:
movdqa %xmm15, %xmm1 # faster
pclmulqdq $0x00, (%rbp), %xmm1
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
movdqa %xmm14, %xmm3
pclmulqdq $0x00, 16(%rbp), %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11
vpclmulqdq $0x00, (%rbp), %xmm15, %xmm1 # slower
aesenc 16(%r15), %xmm8
aesenc 16(%r15), %xmm9
aesenc 16(%r15), %xmm10
aesenc 16(%r15), %xmm11
vpclmulqdq $0x00, 16(%rbp), %xmm14, %xmm3
aesenc 32(%r15), %xmm8
aesenc 32(%r15), %xmm9
aesenc 32(%r15), %xmm10
aesenc 32(%r15), %xmm11
Other question: When I use unaligned data, how to write code like pxor (%rdi), %xmm0
? (editor's note: removed from title because it's a separate question and because there's no better answer other than aligning pointers for the main part of the loop.)
My data has 16-bit (2-byte) alignment. But sometimes I need to load data and then perform xor operation. So I can't write code like this:
pxor (%rdi), %xmm8 # would segfault from misaligned %rdi
pxor 16(%rdi), %xmm9
pxor 32(%rdi), %xmm10
pxor 48(%rdi), %xmm11
I change my code, now the code is correct, but I think may not be very high efficiency, so what should I do?
movdqu (%rdi), %xmm0
movdqu 16(%rdi), %xmm13
movdqu 32(%rdi), %xmm14
movdqu 48(%rdi), %xmm15
pxor %xmm0, %xmm8
pxor %xmm13, %xmm9
pxor %xmm14, %xmm10
pxor %xmm15, %xmm11