I'm trying to identify a performance baseline for memory-bound vectorized loops. I'm doing this on an Intel Broadwell chip with AVX2 instructions in a 32byte aligned environment.
A baseline loop uses 8 YMM registers at a time to load from one location and nontemporally store to another:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovapd ymm0, ymmword ptr [ (32) + rdi +8*rax]
vmovapd ymm2, ymmword ptr [ (64) + rdi +8*rax]
vmovapd ymm4, ymmword ptr [ (96) + rdi +8*rax]
vmovapd ymm6, ymmword ptr [ (128) + rdi +8*rax]
vmovapd ymm8, ymmword ptr [ (160) + rdi +8*rax]
vmovapd ymm10, ymmword ptr [ (192) + rdi +8*rax]
vmovapd ymm12, ymmword ptr [ (224) + rdi +8*rax]
vmovapd ymm14, ymmword ptr [ (256) + rdi +8*rax]
vmovntpd ymmword ptr [ (32) + rsi +8*rax], ymm0
vmovntpd ymmword ptr [ (64) + rsi +8*rax], ymm2
vmovntpd ymmword ptr [ (96) + rsi +8*rax], ymm4
vmovntpd ymmword ptr [ (128) + rsi +8*rax], ymm6
vmovntpd ymmword ptr [ (160) + rsi +8*rax], ymm8
vmovntpd ymmword ptr [ (192) + rsi +8*rax], ymm10
vmovntpd ymmword ptr [ (224) + rsi +8*rax], ymm12
vmovntpd ymmword ptr [ (256) + rsi +8*rax], ymm14
add rax, (4*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
I assemble it with YASM and then do a test with the Intel Architecture Code Analyzer (IACA), which tells me:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 8.0 4.0 | 8.0 4.0 | 8.0 | 0.5 | 0.5 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
| 1 | | 0.5 | | | | 0.5 | | | | add rax, 0x20
| 1 | 0.5 | | | | | | 0.5 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff78
I was under the impression that I could get 2x loads at a time with broadwell with simultaneous loads on ports 2&3. Why isn't that happening?
Thanks
UPDATE
Per the suggestion below, pd's were replaced with ps's and the addresses were consolidated to a single register, the new code looks like:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
xor rbx,rbx
xor rcx,rcx
or rbx, rdi
or rcx, rsi
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovaps ymm0, ymmword ptr [ (32) + rbx ]
vmovaps ymm2, ymmword ptr [ (64) + rbx ]
vmovaps ymm4, ymmword ptr [ (96) + rbx ]
vmovaps ymm6, ymmword ptr [ (128) + rbx ]
vmovaps ymm8, ymmword ptr [ (160) + rbx ]
vmovaps ymm10, ymmword ptr [ (192) + rbx ]
vmovaps ymm12, ymmword ptr [ (224) + rbx ]
vmovaps ymm14, ymmword ptr [ (256) + rbx ]
vmovntps ymmword ptr [ (32) + rcx], ymm0
vmovntps ymmword ptr [ (64) + rcx], ymm2
vmovntps ymmword ptr [ (96) + rcx], ymm4
vmovntps ymmword ptr [ (128) + rcx], ymm6
vmovntps ymmword ptr [ (160) + rcx], ymm8
vmovntps ymmword ptr [ (192) + rcx], ymm10
vmovntps ymmword ptr [ (224) + rcx], ymm12
vmovntps ymmword ptr [ (256) + rcx], ymm14
add rax, (4*8)
add rbx, (4*8*8)
add rcx, (4*8*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
Then IACA tells me:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 5.3 4.0 | 5.3 4.0 | 8.0 | 1.0 | 1.0 | 5.3 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm0, ymmword ptr [rbx+0x20]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm2, ymmword ptr [rbx+0x40]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm4, ymmword ptr [rbx+0x60]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm6, ymmword ptr [rbx+0x80]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm8, ymmword ptr [rbx+0xa0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm10, ymmword ptr [rbx+0xc0]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm12, ymmword ptr [rbx+0xe0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm14, ymmword ptr [rbx+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | 1.0 | | | | | | | | add rbx, 0x100
| 1 | | | | | | 1.0 | | | | add rcx, 0x100
| 1 | | | | | | | 1.0 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff7a
Which tells me that the stores can now use port 7 for addresses and the operations were stored. IACA tells me that the "Block throughput" is still 8 operations due to the extra operations to get the addresses onto a single register. Maybe I'm doing this wrong?
I still don't understand why the load operations cannot be fused