Intel broadwell uop fusion for AVX load/store instructions

Question

I'm trying to identify a performance baseline for memory-bound vectorized loops. I'm doing this on an Intel Broadwell chip with AVX2 instructions in a 32byte aligned environment.

A baseline loop uses 8 YMM registers at a time to load from one location and nontemporally store to another:

%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC

align 32                ;; avx2 vector alignement

global _ls_01_opt

section .text

_ls_01_opt:             ;rdi is input, rsi output
  push rbp
  mov rbp,rsp

  xor rax,rax

  mov ebx, 111          ; IACA PREFIX
  db 0x64, 0x67, 0x90   ; 

  LOOP0:
    vmovapd ymm0, ymmword ptr [  (32) + rdi +8*rax]
    vmovapd ymm2, ymmword ptr [  (64) + rdi +8*rax]
    vmovapd ymm4, ymmword ptr [  (96) + rdi +8*rax]
    vmovapd ymm6, ymmword ptr [  (128) + rdi +8*rax]

    vmovapd ymm8, ymmword ptr  [  (160) + rdi +8*rax]
    vmovapd ymm10, ymmword ptr [  (192) + rdi +8*rax]
    vmovapd ymm12, ymmword ptr [  (224) + rdi +8*rax]
    vmovapd ymm14, ymmword ptr [  (256) + rdi +8*rax]

    vmovntpd ymmword ptr [  (32) + rsi +8*rax], ymm0
    vmovntpd ymmword ptr [  (64) + rsi +8*rax], ymm2
    vmovntpd ymmword ptr [  (96) + rsi +8*rax], ymm4
    vmovntpd ymmword ptr [  (128) + rsi +8*rax], ymm6

    vmovntpd ymmword ptr [  (160) + rsi +8*rax], ymm8
    vmovntpd ymmword ptr [  (192) + rsi +8*rax], ymm10
    vmovntpd ymmword ptr [  (224) + rsi +8*rax], ymm12
    vmovntpd ymmword ptr [  (256) + rsi +8*rax], ymm14

  add rax, (4*8)
  cmp rax, SIZE
  jne LOOP0


  mov ebx, 222          ; IACA SUFFIX
  db 0x64, 0x67, 0x90   ; 

  ret

I assemble it with YASM and then do a test with the Intel Architecture Code Analyzer (IACA), which tells me:

Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles       Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 8.0    4.0  | 8.0    4.0  | 8.0  | 0.5  | 0.5  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
|   1    |           | 0.5 |           |           |     | 0.5 |     |     |    | add rax, 0x20
|   1    | 0.5       |     |           |           |     |     | 0.5 |     |    | cmp rax, 0x8000000
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffff78

I was under the impression that I could get 2x loads at a time with broadwell with simultaneous loads on ports 2&3. Why isn't that happening?

Thanks

UPDATE

Per the suggestion below, pd's were replaced with ps's and the addresses were consolidated to a single register, the new code looks like:

%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC

align 32                ;; avx2 vector alignement

global _ls_01_opt

section .text

_ls_01_opt:             ;rdi is input, rsi output
  push rbp
  mov rbp,rsp

  xor rax,rax
  xor rbx,rbx
  xor rcx,rcx

  or rbx, rdi
  or rcx, rsi


  mov ebx, 111          ; IACA PREFIX
  db 0x64, 0x67, 0x90   ; 

  LOOP0:
    vmovaps ymm0, ymmword ptr  [   (32) + rbx ]
    vmovaps ymm2, ymmword ptr  [   (64) + rbx ]
    vmovaps ymm4, ymmword ptr  [   (96) + rbx ]
    vmovaps ymm6, ymmword ptr  [  (128) + rbx ]

    vmovaps ymm8, ymmword ptr  [  (160) + rbx ]
    vmovaps ymm10, ymmword ptr [  (192) + rbx ]
    vmovaps ymm12, ymmword ptr [  (224) + rbx ]
    vmovaps ymm14, ymmword ptr [  (256) + rbx ]

    vmovntps ymmword ptr [   (32) + rcx], ymm0
    vmovntps ymmword ptr [   (64) + rcx], ymm2
    vmovntps ymmword ptr [   (96) + rcx], ymm4
    vmovntps ymmword ptr [  (128) + rcx], ymm6

    vmovntps ymmword ptr [  (160) + rcx], ymm8
    vmovntps ymmword ptr [  (192) + rcx], ymm10
    vmovntps ymmword ptr [  (224) + rcx], ymm12
    vmovntps ymmword ptr [  (256) + rcx], ymm14

  add rax, (4*8)
  add rbx, (4*8*8)
  add rcx, (4*8*8)
  cmp rax, SIZE
  jne LOOP0


  mov ebx, 222          ; IACA SUFFIX
  db 0x64, 0x67, 0x90   ; 

  ret

Then IACA tells me:

Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles       Throughput Bottleneck: Port4

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 5.3    4.0  | 5.3    4.0  | 8.0  | 1.0  | 1.0  | 5.3  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovaps ymm0, ymmword ptr [rbx+0x20]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovaps ymm2, ymmword ptr [rbx+0x40]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovaps ymm4, ymmword ptr [rbx+0x60]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovaps ymm6, ymmword ptr [rbx+0x80]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovaps ymm8, ymmword ptr [rbx+0xa0]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovaps ymm10, ymmword ptr [rbx+0xc0]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovaps ymm12, ymmword ptr [rbx+0xe0]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovaps ymm14, ymmword ptr [rbx+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
|   1    | 1.0       |     |           |           |     |     |     |     |    | add rax, 0x20
|   1    |           | 1.0 |           |           |     |     |     |     |    | add rbx, 0x100
|   1    |           |     |           |           |     | 1.0 |     |     |    | add rcx, 0x100
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rax, 0x8000000
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffff7a

Which tells me that the stores can now use port 7 for addresses and the operations were stored. IACA tells me that the "Block throughput" is still 8 operations due to the extra operations to get the addresses onto a single register. Maybe I'm doing this wrong?

I still don't understand why the load operations cannot be fused

http://stackoverflow.com/q/25899395/2542702 – Z boson Mar 07 '16 at 08:27 — Z boson, Mar 07 '16 at 08:27

score 5 · Accepted Answer · edited May 23 '17 at 12:07

5

The store-AGU on port7 can only handle "simple" effective addresses, so your stores also need the AGU on the load ports. IACA does show your loads not actually competing with each other; it's the stores that are competing.

Note that there are only ~10 fill buffers per core for MOVNT stores, so those will very quickly fill up and be the bottleneck.

See also Micro fusion and addressing modes. Your stores could micro-fuse and take fewer fused-domain uops if you used a one-register addressing mode for them.

Also, I guess it doesn't matter for VEX-coded instructions, but the SSE pd versions take an extra byte of x86 machine code. clang tends to use movaps for loads/stores because it's shorter, even on integer vectors. Every existing CPU runs movaps / movapd the same. So I'd suggest just using vmovaps / vmovntps. It won't make any difference at all, though. Just one less set bit in the VEX prefix.

edited May 23 '17 at 12:07

Community

1
1

answered Mar 05 '16 at 21:45

Peter Cordes

328,167
45
605
847

Thanks for your response. I updated per your suggestions below and was able to get it to use port 7 for the store operations and was able to get some fusion. But I don't understand why the load ops will not fuse... maybe i'm understanding this incorrectly – Gavin Portwood Mar 05 '16 at 23:09
Pure load instructions can't and don't need to micro-fuse. Unfused stores *are* 2 fused-domain uops. Loads can micro-fuse into ALU uops, like `addps xmm0, [rsi]`, but `mov` loads are always 1uop. They make use of both the AGU and the load-data units on the same load execution port. IACA will never show two instructions on the same line, if that's what you're thinking. It's listing uops, not cycles. – Peter Cordes Mar 05 '16 at 23:19
@user2616417: also, you can keep using an indexed addressing mode for loads if you like. Your updated code has three loop counters, but only uses two of them. You should `cmp` one of the pointers against and end-address, and `jb`. You can use `rbx` = src - dest, and do loads from `[rbx + rcx + offset]`, and only need one add in the loop. (Since `mov`-loads don't need to micro-fuse, there's no problem using 2-register addressing modes.) See http://stackoverflow.com/a/35770938/224132 for an example of this. – Peter Cordes Mar 05 '16 at 23:24

Intel broadwell uop fusion for AVX load/store instructions

1 Answers1