0

I want to store a byte integer array in either a memory location or in a xmm register. To access each byte in that array from memory, I would use:

lea rdi,[memory_array]
mov al,[rdi]
mov [rdi],al

To access that each byte in that array from a xmm register, I would use:

pextrb al,xmm0,0 (or pextrb al,xmm0,1, etc).  
pinsrb xmm0,al,0 (or pinsrb al,xmm0,1, etc). 

According to Agner Fog's instruction tables for Skylake:

MOV (to al) has 1 uop fused, 2 uops unfused, and p23 p0156 uops each port, no latency and 0.5 reciprocal throughput.  

PINSRB has 2 uops fused, 2 uops unfused, and 2p5 uops each port, 3 latency and 2 reciprocal throughput.  

PEXTRB has 2 uops fused, 2 uops unfused, p0 p5 uops each port, 3 latency and 1 reciprocal throughput.  

On its face it looks like PINSRB and PEXTRB are slower than MOV, but I'm not sure I'm reading it right. I thought register-to-register operations are generally faster than memory moves. Is my conclusion that the zmm-to-GP register moves are slower than memory moves correct based on the stats above?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
RTC222
  • 2,025
  • 1
  • 20
  • 53
  • 2
    `mov al,[rdi]` - You normally don't want to merge with the old value of RAX; that creates a false dependency. Agner Fog's insn tables don't have sensible numbers for load latency, only the round trip. Also, **an empty latency field in Agner's tables doesn't mean 0 latency, it means he didn't measure it!!** There's L1d load-use latency (or store-forwarding latency), and the [ALU merge latency](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake) because you wrote a byte register instead of using `movzx` like a compiler out. – Peter Cordes May 11 '20 at 18:59
  • 1
    Also, `pinsrb` and `pextrb` take dword GP register operands like EAX, not AL. And yes they're slow for throughput; if you want a lot of separate bytes, vector store / movzx reload. Not totally clear what you're comparing, though; a buffer in memory vs. an XMM register are not usually equivalent for doing other things with them. – Peter Cordes May 11 '20 at 19:00
  • As it's possible to loop through an xmm register with the PINSR/PEXTR instructions, that would work to store an array in an xmm register, but as I suspected (and I think you have confirmed) it looks like a memory array would simply be faster than using PINSR/PEXTR to access elements of an array. The idea is faster access, but it looks like that's not what will happen. – RTC222 May 11 '20 at 19:13
  • 2
    It's not possible to loop with pins/extr instructions. The index has to be immediate, not a loop counter. If you need to do something scalar to all 16 / 32 / 64 byte elements of a vector reg, it's usually faster to use memory and eat the store-fowarding stall for the final vector reload. But if you just need one or two elements, `pextr` and/or `pinsr` are good. – Peter Cordes May 11 '20 at 19:15
  • That answers my question. Thanks for the comments. – RTC222 May 11 '20 at 19:18

0 Answers0