Why application runs faster when co-located with another app on a logical core?

Question

I have two applications (both with a single thread). One is a word reindexing application (named wrmem) which could be found here, and the other one is a simple loop that iterates over an array.

I noticed that when I pin both apps to a logical core, while its sibling hyper-thread is idle, the wrmem runs faster. However, when I run them on two sibling hyper-threads, the wrmem app takes longer to execute.

My question is, why it runs faster when both applications are assigned to a single logical core? It would be great if someone could kindly explain if there is any way to determine whether two applications should be pined to a single hyper-thread or on sibling hyper-threads? (probably from performance counters analysis?!)

I have attached perf's output of the events I thought might be useful to analyse.

wrmem located on sibling hyper-thread

       4036814116      dTLB-loads                                                    (23.52%)
          74228013      dTLB-load-misses          #    1.84% of all dTLB cache hits   (23.56%)
       31356167246      cycles                                                        (23.61%)
        2356371896      dtlb_load_misses.walk_active                                     (23.65%)
          18110002      cache-misses              #    6.865 % of all cache refs      (23.62%)
         263804080      cache-references                                              (23.57%)
          70217803      branch-misses             #    2.62% of all branches          (23.53%)
        2679691364      branches                                                      (23.49%)
         211970964      bus-cycles                                                    (23.49%)
                13      context-switches                                            
            175707      page-faults                                                 
        3996104598      L1-dcache-loads                                               (23.49%)
         527053354      L1-dcache-load-misses     #   13.19% of all L1-dcache hits    (23.50%)
        2437399504      dTLB-stores                                                   (23.50%)
          34857064      dTLB-store-misses                                             (23.50%)
                 0      mem-loads                                                     (23.50%)
          77522441      dtlb_load_misses.miss_causes_a_walk                                     (23.49%)
          37317020      dtlb_store_misses.miss_causes_a_walk                                     (23.49%)
        3481625263      dtlb_store_misses.walk_active                                     (23.49%)

       8.534166029 seconds time elapsed

wrmem located on a single logical core with the other app:

        4021938226      dTLB-loads                                                    (23.45%)
           1339043      dTLB-load-misses          #    0.03% of all dTLB cache hits   (23.58%)
       14092062606      cycles                                                        (23.69%)
          87412240      dtlb_load_misses.walk_active                                     (23.79%)
          15980810      cache-misses              #   32.547 % of all cache refs      (23.84%)
          49100039      cache-references                                              (23.82%)
          77863788      branch-misses             #    2.86% of all branches          (23.76%)
        2725709999      branches                                                      (23.66%)
          95364600      bus-cycles                                                    (23.55%)
               246      context-switches                                            
            175706      page-faults                                                 
        3989720332      L1-dcache-loads                                               (23.44%)
         453219493      L1-dcache-load-misses     #   11.36% of all L1-dcache hits    (23.30%)
        2459754128      dTLB-stores                                                   (23.30%)
          28088729      dTLB-store-misses                                             (23.30%)
                 0      mem-loads                                                     (23.30%)
           1996539      dtlb_load_misses.miss_causes_a_walk                                     (23.40%)
          37192560      dtlb_store_misses.miss_causes_a_walk                                     (23.40%)
        1694922205      dtlb_store_misses.walk_active                                     (23.40%)

       7.684306529 seconds time elapsed

The other app is only a loop over an array with 4096 bytes step:

Other app when running on the same logical core:

        1509514481      dTLB-loads                                                    (23.47%)
        1345520064      dTLB-load-misses          #   89.14% of all dTLB cache hits   (23.50%)
       52986473567      cycles                                                        (23.52%)
       51187627462      dtlb_load_misses.walk_active                                     (23.55%)
          24803686      cache-misses              #    0.771 % of all cache refs      (23.56%)
        3218128188      cache-references                                              (23.56%)
            624235      branch-misses             #    0.04% of all branches          (23.57%)
        1483035278      branches                                                      (23.57%)
         358401846      bus-cycles                                                    (23.58%)
               251      context-switches                                            
            262239      page-faults                                                 
        1630995048      L1-dcache-loads                                               (23.58%)
        2707508863      L1-dcache-load-misses     #  166.00% of all L1-dcache hits    (23.57%)
         194944978      dTLB-stores                                                   (23.57%)
           1773489      dTLB-store-misses                                             (23.53%)
                 0      mem-loads                                                     (23.50%)
        1344962228      dtlb_load_misses.miss_causes_a_walk                                     (23.47%)
           2721695      dtlb_store_misses.miss_causes_a_walk                                     (23.45%)
          78331814      dtlb_store_misses.walk_active                                     (23.46%)

      18.162205841 seconds time elapsed

When running on sibling hyper-thread:

        1570720041      dTLB-loads                                                    (23.50%)
        1342959305      dTLB-load-misses          #   85.50% of all dTLB cache hits   (23.50%)
       59079247016      cycles                                                        (23.50%)
       56895513621      dtlb_load_misses.walk_active                                     (23.53%)
          37209980      cache-misses              #    1.115 % of all cache refs      (23.54%)
        3336817534      cache-references                                              (23.54%)
            626337      branch-misses             #    0.04% of all branches          (23.54%)
        1457502744      branches                                                      (23.54%)
         399413773      bus-cycles                                                    (23.54%)
                10      context-switches                                            
            262239      page-faults                                                 
        1523989098      L1-dcache-loads                                               (23.54%)
        2714388590      L1-dcache-load-misses     #  178.11% of all L1-dcache hits    (23.54%)
         150322599      dTLB-stores                                                   (23.54%)
           1832015      dTLB-store-misses                                             (23.54%)
                 0      mem-loads                                                     (23.54%)
        1341108173      dtlb_load_misses.miss_causes_a_walk                                     (23.54%)
           2718493      dtlb_store_misses.miss_causes_a_walk                                     (23.53%)
          78263090      dtlb_store_misses.walk_active                                     (23.51%)

      16.042126229 seconds time elapsed

*a loop over an array with 4096 bytes step* - so basically worst case for causing memory traffic and TLB misses, although at least its only evicting lines from 1 set of L1d cache, not polluting the whole cache. Presumably your real work is impacted by more than a factor of 2 by the busy loop, so getting the whole physical core to itself half the time is better. Is the other task infinite? Did the busy loop run more total iterations while it was on the other logical core, than when it was context-switching with `wrmem`? If so, it could just be a throughput tradeoff. — Peter Cordes, Jul 07 '22 at 02:10
Thanks, dear @PeterCordes. I updated the question with `perf` output for the busy loop app. The busy loop app runs slower on the same logical core. So apparently it's a tradeoff. But what causes this tradeoff? anything specific or combination of events? Any deterministic model that can guide people on how to place applications so they get better performance on the desired one? For example, I need to get better performance on `wrmem` and I don't care about the busy loop app. Are there any tips on how to schedule them on hyper-threads so I get better performance on `wrmem`? — Mohammad Siavashi, Jul 07 '22 at 08:54
I expect most real workloads wouldn't be as disastrous for it to share a core with. Generally things that suffer a lot are ones sensitive to cache size and memory bandwidth. e.g. linear algebra (especially matmul) is well known to often scale negatively when parallelizing it to share a core with itself, because a well-tuned implementation can already saturate the FP execution units with a single thread per core, and competing for L1 and L2 cache footprint with another copy of itself makes things slower, costing overall throughput as well as per-thread throughput. — Peter Cordes, Jul 07 '22 at 09:03
Then it seems there is no way to dynamically determine this. — Mohammad Siavashi, Jul 07 '22 at 09:12
BTW, it is interesting that the CPU cycles decreased but the execution time increased. @PeterCordes — Mohammad Siavashi, Jul 07 '22 at 09:14
If you're on a Skylake or other Intel with hardware P-state management, that might be its dynamic down-clocking on mostly memory-bound workloads, with energy-performance-preference settings other than `performance`. You forgot to record the `task-clock` software event so perf isn't calculating CPU frequency for you (cycles/CPU-second). Anyway, see [Slowing down CPU Frequency by imposing memory stress](https://stackoverflow.com/q/63399456). Interestingly, running a `_mm_pause()` loop on a separate physical core can help, because "client" CPUs share one clock for all unhalted cores. — Peter Cordes, Jul 07 '22 at 09:21

Why application runs faster when co-located with another app on a logical core?

0 Answers0