I want to store a byte integer array in either a memory location or in a xmm register. To access each byte in that array from memory, I would use:
lea rdi,[memory_array]
mov al,[rdi]
mov [rdi],al
To access that each byte in that array from a xmm register, I would use:
pextrb al,xmm0,0 (or pextrb al,xmm0,1, etc).
pinsrb xmm0,al,0 (or pinsrb al,xmm0,1, etc).
According to Agner Fog's instruction tables for Skylake:
MOV (to al) has 1 uop fused, 2 uops unfused, and p23 p0156 uops each port, no latency and 0.5 reciprocal throughput.
PINSRB has 2 uops fused, 2 uops unfused, and 2p5 uops each port, 3 latency and 2 reciprocal throughput.
PEXTRB has 2 uops fused, 2 uops unfused, p0 p5 uops each port, 3 latency and 1 reciprocal throughput.
On its face it looks like PINSRB and PEXTRB are slower than MOV, but I'm not sure I'm reading it right. I thought register-to-register operations are generally faster than memory moves. Is my conclusion that the zmm-to-GP register moves are slower than memory moves correct based on the stats above?