Are PINSRB and PEXTRB faster or slower than MOV?

Question

I want to store a byte integer array in either a memory location or in a xmm register. To access each byte in that array from memory, I would use:

lea rdi,[memory_array]
mov al,[rdi]
mov [rdi],al

To access that each byte in that array from a xmm register, I would use:

pextrb al,xmm0,0 (or pextrb al,xmm0,1, etc).  
pinsrb xmm0,al,0 (or pinsrb al,xmm0,1, etc).

According to Agner Fog's instruction tables for Skylake:

MOV (to al) has 1 uop fused, 2 uops unfused, and p23 p0156 uops each port, no latency and 0.5 reciprocal throughput.  

PINSRB has 2 uops fused, 2 uops unfused, and 2p5 uops each port, 3 latency and 2 reciprocal throughput.  

PEXTRB has 2 uops fused, 2 uops unfused, p0 p5 uops each port, 3 latency and 1 reciprocal throughput.

On its face it looks like PINSRB and PEXTRB are slower than MOV, but I'm not sure I'm reading it right. I thought register-to-register operations are generally faster than memory moves. Is my conclusion that the zmm-to-GP register moves are slower than memory moves correct based on the stats above?

`mov al,[rdi]` - You normally don't want to merge with the old value of RAX; that creates a false dependency. Agner Fog's insn tables don't have sensible numbers for load latency, only the round trip. Also, **an empty latency field in Agner's tables doesn't mean 0 latency, it means he didn't measure it!!** There's L1d load-use latency (or store-forwarding latency), and the [ALU merge latency](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake) because you wrote a byte register instead of using `movzx` like a compiler out. — Peter Cordes, May 11 '20 at 18:59
Also, `pinsrb` and `pextrb` take dword GP register operands like EAX, not AL. And yes they're slow for throughput; if you want a lot of separate bytes, vector store / movzx reload. Not totally clear what you're comparing, though; a buffer in memory vs. an XMM register are not usually equivalent for doing other things with them. — Peter Cordes, May 11 '20 at 19:00
As it's possible to loop through an xmm register with the PINSR/PEXTR instructions, that would work to store an array in an xmm register, but as I suspected (and I think you have confirmed) it looks like a memory array would simply be faster than using PINSR/PEXTR to access elements of an array. The idea is faster access, but it looks like that's not what will happen. — RTC222, May 11 '20 at 19:13
It's not possible to loop with pins/extr instructions. The index has to be immediate, not a loop counter. If you need to do something scalar to all 16 / 32 / 64 byte elements of a vector reg, it's usually faster to use memory and eat the store-fowarding stall for the final vector reload. But if you just need one or two elements, `pextr` and/or `pinsr` are good. — Peter Cordes, May 11 '20 at 19:15

Are PINSRB and PEXTRB faster or slower than MOV?

0 Answers0