Following is the loop body of a NASM program (loop body means I am not showing the parts that instantiate cores and shared memory, read the input data, write the final results to file). This program is a shared object called from a C wrapper. Line numbers are shown for nine of the lines; they correspond to the line numbers referenced in the notes below.
mov rax,255
kmovq k7,rax
label_401:
cmp r11,r10
jge label_899
vmovupd zmm14,[r12+r11] ;[185]
add r11,r9 ; stride ;[186]
vmulpd zmm13,zmm14,zmm31 ; [196]
vmulpd zmm9,zmm14,zmm29 ; [207]
vmulpd zmm8,zmm13,zmm30
mov r8,1
Exponent_Label_0:
vmulpd zmm7,zmm29,zmm29
add r8,1
cmp r8,2 ;rdx
jl Exponent_Label_0
vmulpd zmm3,zmm7,zmm8
vsubpd zmm0,zmm9,zmm3
vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0{k7},zmm1 ; [240]
VCVTUQQ2PD zmm2{k7},zmm0 ; [241]
vsubpd zmm3,zmm1,zmm2
vmulpd zmm4,zmm3,zmm27 ; [243]
VCVTTPD2QQ zmm5{k7}{z},zmm4
VPCMPGTQ k2,zmm5,zmm26
VPCMPEQQ k3 {k7},zmm5,zmm26
KADDQ k1,k2,k3
VCVTQQ2PD zmm2{k7},zmm0 ; [252]
vmulpd zmm1{k7},zmm2,zmm25
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25
vmovapd [r15+r14],zmm2 ; [266]
add r14,r9 ; stride
jmp label_401
The program uses AVX-512 register-to-register instructions exclusively between the data read at line 185 to where the final results are written to a shared memory buffer at line 266. I ran this with 1 core and with 4 cores, but the 4-core version is 2-3 times slower than the single core. I profiled it with Linux perf to understand why AVX-512 is 2-3x slower with multicore than with a single core.
The perf reports shown below were done by running all 65 PEBS counters with perf record / perf annotate -- to see results by source code line -- and perf stat to get the full count. Each perf record and perf stat counter was a separate run, and the results are aggregated by source code line, with the count from perf stat shown below each.
Each instruction is followed by the source code line number. For perf record instructions it shows the percentage of that counter attributable to the source line, and the total count of such instructions (from perf stat) in parentheses at the end of each line.
My main question is why we see cache hits and misses with multicore on AVX-512 instructions that are all register-to-register instructions, but not with the same instructions on single core. There should not be any cache hits or misses for an instruction that is entirely within registers. Each core has its own set of registers so I would not expect any cache activity where the instructions are all register-to-register. We see virtually no cache activity in all-register instructions when run with only a single core.
1. Line 186 - add r11,r9
mem_inst_retired.all_loads 75.00% (447119383)
mem_inst_retired.all_stores 86.36% (269650353)
mem_inst_retired.split_loads 71.43% (6588771)
mem_load_retired.l1_hit 57.14% (443561879)
Single core (line 177) - add r11,r9
mem_inst_retired.all_stores 24.00% (267231461)
This instruction (add r11,r9) adds two registers. When run with a single-core we don't see any cache hits/misses or memory loads, but with multicore we do. Why are there cache hits and memory load instructions here with multicore but not with a single core?
2. Line 196 - vmulpd zmm13,zmm14,zmm31
mem_inst_retired.split_loads 28.57% (6588771)
mem_load_retired.fb_hit 100.00% (8327967)
mem_load_retired.l1_hit 14.29% (443561879)
mem_load_retired.l1_miss 66.67% (11033416)
Single core (line 187) - vmulpd zmm13,zmm14,zmm31
mem_load_retired.fb_hit 187 100.00% (8889146)
This instruction (vmulpd zmm13,zmm14,zmm31) is all registers, but again it shows L1 hits and misses and split loads with multicore but not with a single core.
3. Line 207 - vmulpd zmm9,zmm14,zmm29
mem_load_retired.l1_hit 14.29% (443561879)
mem_load_retired.l1_miss 33.33% (11033416)
rs_events.empty_end 25.00% (37013411)
Single core (line 198):
mem_inst_retired.all_stores 24.00% (267231461)
mem_inst_retired.stlb_miss_stores 22.22%
This instruction (vmulpd zmm9,zmm14,zmm29) is the same instruction as the one described above it (vmulpd, all registers), but again it shows L1 hits and misses and split loads with multicore but not with a single core. The single core does show second-level TLB misses and store instructions retired, but no cache activity.
4. Line 240 - VCVTTPD2QQ zmm0{k7},zmm1
mem_inst_retired.all_loads 23.61% (447119383)
mem_inst_retired.split_loads 26.67% (6588771)
mem_load_l3_hit_retired.xsnp_hitm 28.07% (1089506)
mem_load_l3_hit_retired.xsnp_none 12.90% (1008914)
mem_load_l3_miss_retired.local_dram 40.00% (459610)
mem_load_retired.fb_hit 29.21% (8327967)
mem_load_retired.l1_miss 19.82% (11033416)
mem_load_retired.l2_hit 10.22% (12323435)
mem_load_retired.l2_miss 24.84% (2606069)
mem_load_retired.l3_hit 19.70% (700800)
mem_load_retired.l3_miss 21.05% (553670)
Single core line 231:
mem_load_retired.l1_hit 25.00% (429499496)
mem_load_retired.l3_hit 50.00% (306278)
This line (VCVTTPD2QQ zmm0{k7},zmm1) is register-to-register. The single core shows L1 and L3 activity, but the multicore has much more cache activity.
5. Line 241 - VCVTUQQ2PD zmm2{k7},zmm0
mem_load_l3_hit_retired.xsnp_hitm 21.05% (1089506)
mem_load_l3_miss_retired.local_dram 10.00% (459610)
mem_load_retired.fb_hit 10.89% (8327967)
mem_load_retired.l2_miss 13.07% (2606069)
mem_load_retired.l3_miss 10.53%
Single core line 232:
Single core has no cache hits or misses reported
mem_load_retired.l1_hit 12.50% (429499496)
All-register instruction (VCVTUQQ2PD zmm2{k7},zmm0) that shows a lot of cache activity with multicore but only a small number of L1 hits with single core (12.5%). I would not expect to see any cache hits/misses or load/store instructions with an all-register instruction.
6. Line 243 - vmulpd zmm4,zmm3,zmm27
br_inst_retired.all_branches_pebs 12.13% (311104072)
Single core line 234:
mem_load_l3_hit_retired.xsnp_none 100.00% (283620)
Why do we see branch instructions for an all-register mul instruction?
7. Line 252 - VCVTQQ2PD zmm2{k7},zmm0
br_inst_retired.all_branches_pebs 16.62% (311104072)
mem_inst_retired.all_stores 21.22% (269650353)
Single core line 243:
Single core also has branch instructions
br_inst_retired.all_branches_pebs 22.16% (290445009)
For a register-to-register instruction (VCVTQQ2PD zmm2{k7},zmm0), why do we see branch instructions? This instruction does not branch, nor is it preceded or followed by a branch.
8. Line 266 - vmovapd [r15+r14],zmm2
br_inst_retired.all_branches_pebs 43.56% (311104072)
mem_inst_retired.all_loads 48.67% (447119383)
mem_inst_retired.all_stores 43.09% (269650353)
mem_inst_retired.split_loads 41.30% (6588771)
mem_inst_retired.stlb_miss_loads 11.36% (487591)
mem_inst_retired.stlb_miss_stores 12.50% (440729)
mem_load_l3_hit_retired.xsnp_hitm 33.33% (1089506)
mem_load_l3_hit_retired.xsnp_none 56.45% (1008914)
mem_load_l3_miss_retired.local_dram 35.00% (459610)
mem_load_retired.fb_hit 39.60% (8327967)
mem_load_retired.l1_hit 48.75% (443561879)
mem_load_retired.l1_miss 51.65% (11033416)
mem_load_retired.l2_hit 71.51% (12323435)
mem_load_retired.l2_miss 45.10% (2606069)
mem_load_retired.l3_hit 59.09% (700800)
mem_load_retired.l3_miss 47.37% (553670)
Single core line 257:
mem_inst_retired.all_loads 84.86% (426023012)
mem_inst_retired.all_loads
mem_inst_retired.all_stores 59.28% (267231461)
mem_inst_retired.split_loads 89.92% (6477955)
mem_load_l3_miss_retired.local_dram 100.00% (372586)
mem_load_retired.fb_hit 92.80% (8889146)
mem_load_retired.l1_hit 54.17% (429499496)
mem_load_retired.l1_miss 91.30% (4170386)
mem_load_retired.l2_hit 100.00% (4564407)
mem_load_retired.l2_miss 100.00% (476024)
mem_load_retired.l3_hit 33.33% (306278)
This line (vmovapd [r15+r14],zmm2) may be the line most likely to affect the difference between single core and multicore. Here we transfer the final results to a memory buffer that is shared by all cores. Because there is memory movement, we expect to see cache activity with both multicore and single core. The single core uses a single buffer created with malloc. For multicore it's posix shared memory because that ran significantly faster than with an array created with malloc.
Both single core and multicore were run on an Intel Xeon Gold 6140 CPU @ 2.30GHz, which has two FMA units for AVX-512.
To summarize, my questions are: (1) why do we see cache activity on register-to-register instructions with AVX-512 multicore but not single core (except rare cases); and (2) is there any way to bypass cache entirely at vmovapd [r15+r14],zmm2 and go straight to memory to avoid cache misses? Posix shared memory was an improvement but that doesn't do it completely. Finally, are there any other reason(s) why AVX-512 would be so much slower with multicore than with a single core?
UPDATE: the access pattern for this code is dictated by AVX - the stride is (64 x number of cores) bytes. With 4 cores, core 0 begins at 0, reads and processes 64 bytes, then jumps by 256 (64x4); core 1 begins at 64, reads and processes 64 bytes, then jumps by 256, etc.