Is it possible to use ymm16 - ymm31 for AVX2 vpcmpeq{size} instructions?

Question

I am wondering if it is possible to do something along the lines of:

vpcmpeqb %ymm16, %ymm17, %ymm16

Trying to do this an compiling with gcc I get:

Assembler messages: Error: unsupported instruction `vpcmpeqb'

AFAICT this is impossible felixcloutier says that the only EVEX prefix instructions for cmpeq have a mask destination but possibly there is something I am missing or a way to do this directly with the byte encoding.

Thanks!

score 6 · Accepted Answer · answered Mar 31 '21 at 19:58

6

X / YMM16..31 require an EVEX to access at all.

You can't use them with AVX1 / AVX2 forms of instructions.
So no, either compare only into mask regs, or use ymm0..15.

A VEX prefix + modrm only has a total of 4 bits per instruction, so there'd be no way for the AVX1/2 encoding to use a register number that needs 5 bits.

GAS's error message is unhelpful. Perhaps it decides that it's the EVEX form based on the use of AVX-512-only registers, then notices that it's the wrong set of operands.

NASM says "invalid combination of opcode and operands" which is not very specific either, but at least correct.

clang's built-in assembler is probably the best:

foo.s:1:26: error: invalid operand for instruction
vpcmpeqb %ymm16, %ymm17, %ymm16
                         ^~~~~~

answered Mar 31 '21 at 19:58

Peter Cordes

328,167
45
605
847

Bummer, so basically with `ymm16`...`ymm31` you are forced into 3c latency p5 bottleneck on cmp instructions. Annoying design. – Noah Mar 31 '21 at 20:04
@Noah: Oh, yeah :/ If that's a problem, arrange for your data to be in YMM0..15 for those instructions, even if that means you need a VZEROUPPER when you're done. You can freely mix AVX2 and AVX-512, for example counting matches with AVX2 `vpcmpeqb (%rdi), %ymm0, %ymm1` / AVX-512 `vpsubb %ymm1, %ymm30, %ymm30`. So if overall register-pressure is a problem, you can still use AVX2 stuff on half the total YMM regs. – Peter Cordes Mar 31 '21 at 20:09
In general yeah. Unfortunately not an easy option always. `vzeroupper` aborts HLE transactions. So if you want to use `ymm0`...`ymm15` in a HLE you need to explicitly `xor` zero the registers I use. – Noah Mar 31 '21 at 20:22
2

@Noah: vpxor-zeroing the regs you used is *not* equivalent to vzeroupper for avoiding [later problems with SSE instructions](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852). I forget if there's a Q&A explicitly confirming that, but [this](https://stackoverflow.com/q/49019614/224132) is related. Maybe you can `vzeroupper` after the transaction, perhaps with `_mm_zeroupper()` in an outer function, if you're calling a hand-written asm function as part of a transaction so can't put it there. – Peter Cordes Mar 31 '21 at 20:29
Oof, you saved me big time! That too is incredibly annoying design! – Noah Mar 31 '21 at 20:35

Is it possible to use ymm16 - ymm31 for AVX2 vpcmpeq{size} instructions?

1 Answers1