Cache hits and misses on AVX-512 multicore but not single core

Question

Following is the loop body of a NASM program (loop body means I am not showing the parts that instantiate cores and shared memory, read the input data, write the final results to file). This program is a shared object called from a C wrapper. Line numbers are shown for nine of the lines; they correspond to the line numbers referenced in the notes below.

mov rax,255
kmovq k7,rax

label_401:

    cmp r11,r10
    jge label_899

    vmovupd zmm14,[r12+r11] ;[185]
    add r11,r9 ; stride ;[186]

    vmulpd zmm13,zmm14,zmm31 ; [196]

    vmulpd zmm9,zmm14,zmm29 ; [207]
    vmulpd zmm8,zmm13,zmm30

    mov r8,1
    Exponent_Label_0:
    vmulpd zmm7,zmm29,zmm29
    add r8,1
    cmp r8,2 ;rdx
    jl Exponent_Label_0

    vmulpd zmm3,zmm7,zmm8

    vsubpd zmm0,zmm9,zmm3

    vmulpd zmm1,zmm0,zmm28
    VCVTTPD2QQ zmm0{k7},zmm1 ; [240]
    VCVTUQQ2PD zmm2{k7},zmm0 ; [241]
    vsubpd zmm3,zmm1,zmm2
    vmulpd zmm4,zmm3,zmm27 ; [243]
    VCVTTPD2QQ zmm5{k7}{z},zmm4

    VPCMPGTQ k2,zmm5,zmm26
    VPCMPEQQ k3 {k7},zmm5,zmm26
    KADDQ k1,k2,k3

    VCVTQQ2PD zmm2{k7},zmm0 ; [252]
    vmulpd zmm1{k7},zmm2,zmm25
    vmovupd zmm2,zmm1
    VADDPD zmm2{k1},zmm1,zmm25

    vmovapd [r15+r14],zmm2 ; [266]
    add r14,r9 ; stride
    jmp label_401

The program uses AVX-512 register-to-register instructions exclusively between the data read at line 185 to where the final results are written to a shared memory buffer at line 266. I ran this with 1 core and with 4 cores, but the 4-core version is 2-3 times slower than the single core. I profiled it with Linux perf to understand why AVX-512 is 2-3x slower with multicore than with a single core.

The perf reports shown below were done by running all 65 PEBS counters with perf record / perf annotate -- to see results by source code line -- and perf stat to get the full count. Each perf record and perf stat counter was a separate run, and the results are aggregated by source code line, with the count from perf stat shown below each.

Each instruction is followed by the source code line number. For perf record instructions it shows the percentage of that counter attributable to the source line, and the total count of such instructions (from perf stat) in parentheses at the end of each line.

My main question is why we see cache hits and misses with multicore on AVX-512 instructions that are all register-to-register instructions, but not with the same instructions on single core. There should not be any cache hits or misses for an instruction that is entirely within registers. Each core has its own set of registers so I would not expect any cache activity where the instructions are all register-to-register. We see virtually no cache activity in all-register instructions when run with only a single core.

1.  Line 186 - add r11,r9
mem_inst_retired.all_loads     75.00% (447119383)
mem_inst_retired.all_stores    86.36% (269650353)
mem_inst_retired.split_loads   71.43% (6588771)
mem_load_retired.l1_hit        57.14% (443561879)

Single core (line 177) - add r11,r9
mem_inst_retired.all_stores    24.00% (267231461)

This instruction (add r11,r9) adds two registers. When run with a single-core we don't see any cache hits/misses or memory loads, but with multicore we do. Why are there cache hits and memory load instructions here with multicore but not with a single core?

2.  Line 196 - vmulpd zmm13,zmm14,zmm31
mem_inst_retired.split_loads     28.57% (6588771)
mem_load_retired.fb_hit         100.00% (8327967)
mem_load_retired.l1_hit         14.29% (443561879)
mem_load_retired.l1_miss        66.67% (11033416)

Single core (line 187) - vmulpd zmm13,zmm14,zmm31
mem_load_retired.fb_hit 187 100.00% (8889146)

This instruction (vmulpd zmm13,zmm14,zmm31) is all registers, but again it shows L1 hits and misses and split loads with multicore but not with a single core.

3. Line 207 - vmulpd zmm9,zmm14,zmm29
mem_load_retired.l1_hit           14.29% (443561879)
mem_load_retired.l1_miss          33.33% (11033416)
rs_events.empty_end               25.00% (37013411)

Single core (line 198):
mem_inst_retired.all_stores       24.00% (267231461)
mem_inst_retired.stlb_miss_stores 22.22%

This instruction (vmulpd zmm9,zmm14,zmm29) is the same instruction as the one described above it (vmulpd, all registers), but again it shows L1 hits and misses and split loads with multicore but not with a single core. The single core does show second-level TLB misses and store instructions retired, but no cache activity.

4.  Line 240 - VCVTTPD2QQ zmm0{k7},zmm1
mem_inst_retired.all_loads               23.61% (447119383)
mem_inst_retired.split_loads             26.67% (6588771)
mem_load_l3_hit_retired.xsnp_hitm        28.07% (1089506)
mem_load_l3_hit_retired.xsnp_none        12.90% (1008914)
mem_load_l3_miss_retired.local_dram      40.00% (459610)
mem_load_retired.fb_hit                  29.21% (8327967)
mem_load_retired.l1_miss                 19.82% (11033416)
mem_load_retired.l2_hit                  10.22% (12323435)
mem_load_retired.l2_miss                 24.84% (2606069)
mem_load_retired.l3_hit                  19.70% (700800)
mem_load_retired.l3_miss                 21.05% (553670)

Single core line 231:
mem_load_retired.l1_hit                  25.00% (429499496)
mem_load_retired.l3_hit                  50.00% (306278)

This line (VCVTTPD2QQ zmm0{k7},zmm1) is register-to-register. The single core shows L1 and L3 activity, but the multicore has much more cache activity.

5.  Line 241 - VCVTUQQ2PD zmm2{k7},zmm0
mem_load_l3_hit_retired.xsnp_hitm        21.05% (1089506)
mem_load_l3_miss_retired.local_dram      10.00% (459610)
mem_load_retired.fb_hit                  10.89% (8327967)
mem_load_retired.l2_miss                 13.07% (2606069)
mem_load_retired.l3_miss                 10.53%

Single core line 232:
Single core has no cache hits or misses reported
mem_load_retired.l1_hit                  12.50% (429499496)

All-register instruction (VCVTUQQ2PD zmm2{k7},zmm0) that shows a lot of cache activity with multicore but only a small number of L1 hits with single core (12.5%). I would not expect to see any cache hits/misses or load/store instructions with an all-register instruction.

6.  Line 243 - vmulpd zmm4,zmm3,zmm27
br_inst_retired.all_branches_pebs        12.13% (311104072)

Single core line 234:
mem_load_l3_hit_retired.xsnp_none        100.00% (283620)

Why do we see branch instructions for an all-register mul instruction?

7.  Line 252 - VCVTQQ2PD zmm2{k7},zmm0
br_inst_retired.all_branches_pebs     16.62% (311104072)
mem_inst_retired.all_stores           21.22% (269650353)

Single core line 243:
Single core also has branch instructions
br_inst_retired.all_branches_pebs     22.16% (290445009)

For a register-to-register instruction (VCVTQQ2PD zmm2{k7},zmm0), why do we see branch instructions? This instruction does not branch, nor is it preceded or followed by a branch.

8.  Line 266 - vmovapd [r15+r14],zmm2
br_inst_retired.all_branches_pebs 43.56% (311104072)
mem_inst_retired.all_loads        48.67% (447119383)
mem_inst_retired.all_stores       43.09% (269650353)
mem_inst_retired.split_loads      41.30% (6588771)
mem_inst_retired.stlb_miss_loads  11.36% (487591)
mem_inst_retired.stlb_miss_stores 12.50% (440729)
mem_load_l3_hit_retired.xsnp_hitm 33.33% (1089506)
mem_load_l3_hit_retired.xsnp_none 56.45% (1008914)
mem_load_l3_miss_retired.local_dram 35.00% (459610)
mem_load_retired.fb_hit             39.60% (8327967)
mem_load_retired.l1_hit             48.75% (443561879)
mem_load_retired.l1_miss            51.65% (11033416)
mem_load_retired.l2_hit             71.51% (12323435)
mem_load_retired.l2_miss            45.10% (2606069)
mem_load_retired.l3_hit             59.09% (700800)
mem_load_retired.l3_miss            47.37% (553670)

Single core line 257:
mem_inst_retired.all_loads          84.86% (426023012)
mem_inst_retired.all_loads 
mem_inst_retired.all_stores         59.28% (267231461)
mem_inst_retired.split_loads        89.92% (6477955)
mem_load_l3_miss_retired.local_dram 100.00% (372586)
mem_load_retired.fb_hit              92.80% (8889146)
mem_load_retired.l1_hit              54.17% (429499496)
mem_load_retired.l1_miss             91.30% (4170386)
mem_load_retired.l2_hit             100.00% (4564407)
mem_load_retired.l2_miss            100.00% (476024)
mem_load_retired.l3_hit              33.33% (306278)

This line (vmovapd [r15+r14],zmm2) may be the line most likely to affect the difference between single core and multicore. Here we transfer the final results to a memory buffer that is shared by all cores. Because there is memory movement, we expect to see cache activity with both multicore and single core. The single core uses a single buffer created with malloc. For multicore it's posix shared memory because that ran significantly faster than with an array created with malloc.

Both single core and multicore were run on an Intel Xeon Gold 6140 CPU @ 2.30GHz, which has two FMA units for AVX-512.

To summarize, my questions are: (1) why do we see cache activity on register-to-register instructions with AVX-512 multicore but not single core (except rare cases); and (2) is there any way to bypass cache entirely at vmovapd [r15+r14],zmm2 and go straight to memory to avoid cache misses? Posix shared memory was an improvement but that doesn't do it completely. Finally, are there any other reason(s) why AVX-512 would be so much slower with multicore than with a single core?

UPDATE: the access pattern for this code is dictated by AVX - the stride is (64 x number of cores) bytes. With 4 cores, core 0 begins at 0, reads and processes 64 bytes, then jumps by 256 (64x4); core 1 begins at 64, reads and processes 64 bytes, then jumps by 256, etc.

Can you format your code to be readable? At least indent instructions relative to labels, not this total mess. Also, your data is really cluttered and messy to look through because you have 2 lines for each event (twice as much clutter to scan through), and there isn't a consistent column for the numbers so different event names make for ragged numbers. — Peter Cordes, Aug 10 '20 at 21:39
(2) You can do NT stores, but that's similar to a guaranteed cache miss. Mostly useful for huge buffers that won't be read until after you've written more than (cache-size) amount of data. I.e. the loads would have missed anyway. Without a simple summary of the access pattern for this code, I'm not interested in reading this much of a wall of text and hard-to-read code, so hard to say more than that. I wonder if hyperthreading is relevant, or possible perf counter errata since most of those events should be tied to an instruction, not just OoO exec of other insns. — Peter Cordes, Aug 10 '20 at 21:41
I changed the formatting of the source code so the line numbers appear as a comment after each line. I gave the perf output as it comes from perf. I will change the perf output so the count (perf stat) appears on the same line as the event. That will take a few minutes. — RTC222, Aug 10 '20 at 22:02
Also, I updated the post above to describe the stride pattern. — RTC222, Aug 10 '20 at 22:02
You're interleaving accesses by each thread? No wonder performance is worse!! L2 spatial prefetch tries to complete adjacent pairs of lines. Also cache aliasing, only using 1/4 of the sets. The standard way to parallelize is to have separate threads write contiguous ranges of outputs. Strided also means more TLB misses per load/store. — Peter Cordes, Aug 10 '20 at 22:07
You missed my point about how to indent asm. You still have all the instructions in the left-most column, not indented relative to labels. See [this codereview](https://codereview.stackexchange.com/questions/204902/checking-if-a-number-is-prime-in-nasm-win64-assembly/204965#204965) for good style. And your perf results are still pretty messy and ragged. More readable would be a table with `name line-number? percent total count` with consistent start columns for each field, so it's easy to compare counts between events. Also, line-number doesn't need repeating for each event. — Peter Cordes, Aug 10 '20 at 22:08
Re stride it seems intuitively (to me) that if each core accesses 64 bytes, with N-way cache with 64 byte lines I would think that we would have a number of cache lines filled with the first, second, third, etc. 64 bytes so each core would have its 64 bytes. I take your comment to mean that I should assign the first core to start at 0, the second core to start at 10 MB (assuming 40MB input), the third core to start at 20MB, etc. That seems like it would cause much more cache thrashing. — RTC222, Aug 10 '20 at 22:14
Also, why are you merge-masking with an all-ones mask for a bunch of instructions? That's just giving you an output dependency for no reason. And if `jge label_899` is the loop exit condition, put it [at the bottom like a do{}while loop](https://stackoverflow.com/questions/47783926/why-are-loops-always-compiled-into-do-while-style-tail-jump). I'd suggest using intrinsics to let the compiler take care of details like that for you. — Peter Cordes, Aug 10 '20 at 22:14
Remember that each core has its own *private* L1d and L2 cache. (Shared only by two hyperthreads on a physical core). If a core only ever uses every 4th cache line, that's only 1/4 of the sets with a simple indexing function like L1d and L2 use (i.e. taking low address bits). Having each core start at a 10MB offset doesn't cause thrashing in L3 because L3 uses a different indexing function (more like a hash of more bits). — Peter Cordes, Aug 10 '20 at 22:17
Thanks for illuminating why a large separation between cores works best. I was told by someone at Intel recently that "AVX512 machines, from Sky Lake server on, have non-inclusive LLC => if something is in LLC then it is not in L1 or L2 (and the other way round)." That could also have some bearing on the issue with AVX-512. — RTC222, Aug 10 '20 at 22:23
Re the NASM code above -- I always write my NASM flush left. I don't indent anything. I'll indent above to make it more readable for others. — RTC222, Aug 10 '20 at 22:24
You misunderstood (or were given wrong info) about Skylake-server caches. What you described is *exclusive* cache policy but Skylake-X isn't like that. It's not-inclusive no-exclusive ([NINE](https://en.wikipedia.org/wiki/Cache_inclusion_policy#NINE_Policy)), so data *can* be in L3 at the same time as L1, but dropping it from L3 can happen without forcing it to be evicted from L1. If a load misses all the way to DRAM, it will populate all 3 levels of cache along the way, just like with previous uarches with inclusive L3. — Peter Cordes, Aug 10 '20 at 22:35
I'm still curious as to why I have any cache access at all on all-register AVX-512 instructions. I don't see where memory would be involved at all. — RTC222, Aug 10 '20 at 22:39
IDK, I'd guess maybe hyperthreading; the other logical core retired a load instruction in the same cycle as an `add` or something. If you can disable HT, or pin threads to separate physical cores, getting different results would be some evidence for that hypothesis. — Peter Cordes, Aug 10 '20 at 22:46
Here's the plan based on your comments: (1) change stride to what I call "even data division" instead of what I call "leapfrog" (2) no all-1s k7 merge-masking; change the loop exit; (3) disable hyperthreading. The threads are already pinned to separate cores. I'll post back with results as soon as I have some. — RTC222, Aug 10 '20 at 22:50
My point with pinning was that you can start half as many threads, and pin them to only the even-numbered logical cores, or only the first half of the logical cores, depending which way the kernel maps core numbers. So you can make sure you don't have threads on sibling logical cores even without rebooting to disable HT. — Peter Cordes, Aug 10 '20 at 22:55
I'm struggling to understand this question with all these events and numbers and single-core/multi-core and counting/sampling. Make the question focused as much as possible. Do you need to show both the event counts and samples? Do you need to mention all these events? Give the exact perf commands and the exact perf output. Don't mix everything together like that. Consider splitting your question into multiple ones. — Hadi Brais, Aug 11 '20 at 02:43
@Peter Cordes - I changed the stride to even data division as described in my comment above and that nearly doubled the performance; changed the loop exit and that was a small improvement; eliminated k7 merge-masking made no difference in performance. — RTC222, Aug 11 '20 at 18:20
Later this week I will post a new (much shorter) question after profiling. My remaining question is the question above -- why do I see cache activity on all-register AVX-512 instructions with multicore but not single core. The threads are all pinned to separate physical cores (I checked), so hyperthreading is not involved. The single core is still about 35% faster than multicore. — RTC222, Aug 11 '20 at 18:21

Cache hits and misses on AVX-512 multicore but not single core

0 Answers0