SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

Question

I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.

I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)

From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.

I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)

General remarks or suggested improvements are also welcome.

   // r8 and r9 are #bytes to go to the next line in resp. src and dest 
   // r12=3*r8 r13=3*r9  
  // load 8x8 bytes into 4 registers, bytes interleaved.
  movq xmm1,[rcx]
  movq xmm4,[rcx+2*r8]
  PUNPCKLBW xmm1,xmm4   // 0 2 0 2 0 2
  movq xmm7,[rcx+r8]
  movq xmm6,[rcx+r12]
  PUNPCKLBW xmm7,xmm6   // 1 3 1 3 1 3

  movdqa xmm2,xmm1
  punpcklbw xmm1,xmm7   // 0 1 2 3 0 1 2 3 in xmm1:xmm2
  punpckhbw xmm2,xmm7   
  lea rcx,[rcx+4*r8]

  // same for 4..7

  movq xmm3,[rcx]
  movq xmm5,[rcx+2*r8]
  PUNPCKLBW xmm3,xmm5
  movq xmm7,[rcx+r8]
  movq xmm8,[rcx+r12]
  PUNPCKLBW xmm7,xmm8

  movdqa xmm4,xmm3
  punpcklbw xmm3,xmm7
  punpckhbw xmm4,xmm7

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

purpose: It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.

updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)

It's not really about legacy SSE, as IACA said it's due to Haswell removing a shuffle unit so they can only go to p5. — harold, Nov 24 '17 at 18:18
Ok, searching for shuffle unit yields some results, like https://www.realworldtech.com/haswell-cpu/4/ Seems can't be helped without going 256, iow AVX2. That said, Aki shaved off some cycles. — Marco van de Voort, Nov 24 '17 at 20:12
Related but no true duplicate. Just count the number of loads. Source and destination format is not a degree of freedom in image manipulation. So e.g. gather operations are not usable (at all). Also it uses vex code. I will still keep it on file for the AVX2 version though, so thanks anyway. — Marco van de Voort, Nov 24 '17 at 21:32
See also [Packing two DWORDs into a QWORD to save store bandwidth](https://stackoverflow.com/questions/47242353/packing-two-dwords-into-a-qword-to-save-store-bandwidth). But you're doing some `punpcklbw`, so you can't avoid SSE2 entirely and just use integer `shld` or something. But note that Haswell has 2 load ports but only 1 shuffle port. — Peter Cordes, Nov 25 '17 at 00:21

score 3 · Answer 1 · edited Nov 25 '17 at 00:40

3

Inserting dword is more if not most efficiently done with combination of pshufd and an immediate blend.

 pshufd xmm5, xmm3, 0x55 * slot
 pblendw xmm1, xmm5, 3 << dst_slot

pblendw is SSE4.1, but is of course available on haswell. Unfortunately it only runs on port 5 on Haswell/Skylake, so it still competes with shuffles.

AVX2 vpblendd runs on any vector-ALU port (p0/p1/p5) on Haswell/Skylake, so is much more efficient than word-granularity pblendw / vpblendw.

If you need to avoid AVX2, consider using SSE4.1 blendps to blend 32-bit elements with an immediate control. It runs on any port on Haswell (or p0/p5 on Sandybridge vs. p1/p5 for shuffles), and the latency penalty for using it on integer data should be irrelevant for your case.

edited Nov 25 '17 at 00:40

Peter Cordes

328,167
45
605
847

answered Nov 24 '17 at 18:05

Aki Suihkonen

19,144
1
36
57

While better (HSW/SKL=28, IVB=14), the factor 2 remains. Probably because of lower shuffle units, as Harold mentioned. – Marco van de Voort Nov 24 '17 at 20:11
@MarcovandeVoort: with AVX2, use `vpshufd` / `vpblendd`. `vpblendd` can run on any port on Haswell, but `pblendw` can still only run on port 5. See http://agner.org/optimize/ for Agner's microarch guide and instruction tables. – Peter Cordes Nov 25 '17 at 00:17
Also, you might be able to use an unaligned load instead of a shuffle, to put the data you want at the position you want inside a vector. (As a setup for a blend) – Peter Cordes Nov 25 '17 at 00:18

Peter Cordes · Accepted Answer · 2017-11-25T02:14:33.757

TL:DR: use punpckl/hdq to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?

Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm / movhps [rdx+r9], xmm.

The code is almost twice times as slow on Haswell/Skylake than on IVB

Your code bottlenecks heavily on shuffle throughput.

Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.

I see you already found David Kanter's excellent Haswell microarch write-up.

It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.

You have some simple missed-optimizations, even before we think about major changes:

  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

should be movhps [rdx+r13],xmm6. On Sandybridge and Haswell, movhps is a pure store uop, with no shuffle uop required.

pextrd eax,xmm3,0 is always worse than movd eax, xmm3; never use pextrd with an immediate 0. (Also, pextrd directly to memory might be a win. You could do a 64-bit movq and then overwrite half of that with a 32-bit pextrd. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.

Source and destination format is not a degree of freedom in image manipulation.

Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.

This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8] / [rcx+r8]. It also lets you load two rows with one movdqa. And it gives you good memory locality for accessing 8x8 blocks.

Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

extract/insert to/from integer is not very efficient; pinsrd and pextrd are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd to memory in 32-bit chunks.

Also consider using SSSE3 pshufb which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por. (You might use pshufb instead of punpcklbw).

Another option is to use shufps to combine data from two sources. You might need another shuffle after that. Or use punpckldq.

// "low" dwords from XMM1:xmm2
//  high dwords from XMM3:xmm4

;  xmm1:  [ a b c d ]   xmm2: [ e f g h ]
;  xmm3:  [ i j k l ]   xmm4: [ m n o p ]

; want: [ a i b j ] / [ c k d l ] / ... I think.

;; original: replace these with
;  movdqa  xmm5,xmm1     ; xmm5 = [ a b c d ]
;  pextrd  eax,xmm3,0    ; eax = i
;  pinsrd  xmm5,eax,1    ; xmm5 = [ a i ... ]
;  movq    [rdx],xmm5

;  movdqa  xmm5,xmm3       ; xmm5 = [ i j k l ]
;  pextrd  eax,xmm1,1      ; eax = b
;  pinsrd  xmm5,eax,0      ; xmm5 = [ b j ... ]
;  movq    [rdx+r9],xmm5

Replace with this:

   movdqa    xmm5, xmm1
   punpckldq xmm5, xmm3     ; xmm5 = [ a i b j ]
   movq     [rdx], xmm5
   movhps   [rdx+r9], xmm5  ; still a pure store, doesn't cost a shuffle

So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).

Use punpckhdq for [ c k d l ], where it's even better because we're replacing movhlps as well.

 ;  movdqa  xmm5,xmm1       ; xmm5 = [ a b c d ]
 ; pextrd  eax,xmm3,2      ; eax = k
 ; pinsrd  xmm5,eax,3      ; xmm5 = [ a b c k ]
 ; MOVHLPS  xmm6,xmm5      ; xmm6 = [ c k ? ? ]  (false dependency on old xmm6)
 ; movq   [rdx+2*r9],xmm6

Then unpack lo/hi for xmm2 and xmm4.

Using AVX or AVX2 would let you skip the movdqa, because you can unpack into a new destination register instead of copy + destroy.

Thanks for the micro opts, will study them. Btw, the redundant moves that AVX2 saves are 0 cycle instructions according to IACA. Thanks too for the reference. That one does shuf first, punpck later, and I do punpck first. Will test if the different approach cuts corners in my case. — Marco van de Voort, Nov 25 '17 at 12:59
@MarcovandeVoort: yes, but they consume front-end bandwidth. You probably bottleneck on throughput, not latency. See https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all for more about zero-latency MOV not being the same thing as "free". And BTW, you only need AVX1 to use 128-bit integer versions of SSE instructions. If your data in memory only comes in 8-byte contiguous chunks, that's probably what you should use. — Peter Cordes, Nov 25 '17 at 13:00
@MarcovandeVoort: And yeah, using `pshufb` instead of `punpcklbw` / wb to feed `punpckl/hdq` may be good, but in your case with data coming in 8-byte chunks to start with, you may have less to gain. Definitely worth a second look at that other question now that you know that `movhps` is a good option for split stores of an XMM vector. — Peter Cordes, Nov 25 '17 at 13:03
I've been testing that the last 5 minutes using 4 times your proposed block (lp and hp for xmm1:xmm2 and xmm3:xmm4) New results IVB=10.75 HSW/SKL=12.0 cycles. Unbelievable, I had to check twice! Both faster and the difference nearly gone. — Marco van de Voort, Nov 25 '17 at 13:08
IACA says backend is throughput, but afaik it doesn't support latency anymore. Will study the link anyway to get a better feel for it. — Marco van de Voort, Nov 25 '17 at 13:11
@MarcovandeVoort: only twice as fast on IvB? Oh well, I guess my version isn't bottlenecked on shuffles anymore on IvB :P. 3x speedup on HSW sounds about right, still bottlenecked on shuffles it seems (not surprising, shuffling + store/load is all it's doing and the single shuffle unit really hurts HSW for stuff like this). — Peter Cordes, Nov 25 '17 at 13:11
@MarcovandeVoort: new IACA has a "pipeline trace" mode instead of latency mode. You can use that to see the effect of latency on a loop. Or of course if a loop-carried dependency chain is a throughput bottleneck, it will show up in the throughput output. — Peter Cordes, Nov 25 '17 at 13:14

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

2 Answers2

Linked