Whiskey Lake i7-8565U
The RESOURCE_STALLS.OTHER
does not look like a well-explained by the Intel docs:
Counts the number of cycles while execution was stalled due to other resource issues.
I ran the experiments on an example of memory copy of 16MiB
randomly generated data in a loop consisting of 6400
iterations.
Baseline:
avx_memcpy_baseline:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_baseline_loop:
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_baseline_loop
ret
Baseline counters:
823 292 269 resource_stalls.any
181 045 r02a2 #LOAD
831 370 403 r04a2 #RS_FULL
49 659 resource_stalls.sb
130 100 r10a2 #ROB_FULL
63 386 r20a2 #FPCW
2 151 516 r40a2 #MSCXR
4 222 r80a2 #OTHER
WB Stores:
avx_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_memcpy_forward_loop_llss
ret
WB Stores counters:
27 089 245 473 resource_stalls.any
4 873 836 r02a2 #LOAD
14 099 696 r04a2 #RS_FULL
24 130 341 296 resource_stalls.sb
5 790 969 r10a2 #ROB_FULL
375 032 r20a2 #FPCW
3 395 592 r40a2 #MXCSR
4 899 892 032 r80a2 #resource_stalls.other 14% of RESOURCE_STALL.ANY
NT Stores:
avx_nt_memcpy_forward_llss:
shr rdx, 0x3
xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
vmovdqa ymm0, [rsi + 8*rcx]
vmovdqa ymm1, [rsi + 8*rcx + 0x20]
vmovntdq [rdi + rcx*8], ymm0
vmovntdq [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja avx_nt_memcpy_forward_loop_llss
ret
NT Stores counters:
18 121 917 993 resource_stalls.any
2 211 195 r02a2 #LOAD
5 588 784 r04a2 #RS_FULL
12 061 475 989 resource_stalls.sb
3 156 129 r10a2 #ROB_FULL
165 967 r20a2 #FPCW
2 152 595 r40a2 #MXCSR
6 730 668 837 r80a2 #resource_stalls.other 33% of RESOURCE_STALLS.ANY
It is quite very noticeable in case of NonTemporal Stores where it took 1/3 of all resource stalls so I'm curious to know what the RESOURCE_STALLS.OTHER
might mean when profiling memory bound routines on Skylake or later.