The use of OpenMP dynamic scheduling and its impact on performance - how to find detail

Question

When running an algorithm that does not use scheduling and uses scheduling, the performance difference is dramatic - with scheduling, the algorithm finishes in 4 seconds and non in 14 seconds. I thought perf would provide some insight as to why this might be occurring but the stats are very similar.

Is it safe to assume that by handling with dynamic scheduling I have addressed some issue with load balancing? I was hoping to find something in the perf detail. Below is the detail in case helpful. Also code used for Pagerank where scheduling is used...

    # pragma omp parallel for schedule(dynamic, 64)
    for (int u = 0; u < vertex_count; u++) {
        int* in_edge = G.getAdjList(u);
        double sum = 0.0;
        for (int j = 0; j < in_edge_counts[u]; j++) {
            int v = in_edge[j];
            sum += conntrib[v];
        }
        pr_temp[u] = sum * damp + adj;
    }

With the use of scheduling

     107470.977295      task-clock (msec)         #    1.743 CPUs utilized
             1,187      context-switches          #    0.011 K/sec
                44      cpu-migrations            #    0.000 K/sec
         2,279,522      page-faults               #    0.021 M/sec
   255,920,277,205      cycles                    #    2.381 GHz                      (20.00%)
    17,116,048,117      stalled-cycles-frontend   #    6.69% frontend cycles idle     (20.02%)
   153,944,352,418      stalled-cycles-backend    #   60.15% backend cycles idle      (20.02%)
   148,412,677,859      instructions              #    0.58  insn per cycle
                                                  #    1.04  stalled cycles per insn  (30.01%)
    27,479,936,585      branches                  #  255.696 M/sec                    (40.01%)
       321,470,463      branch-misses             #    1.17% of all branches          (50.01%)
    78,562,370,506      L1-dcache-loads           #  731.010 M/sec                    (50.00%)
     2,075,635,902      L1-dcache-load-misses     #    2.64% of all L1-dcache hits    (49.99%)
     3,100,740,665      LLC-loads                 #   28.852 M/sec                    (50.00%)
       964,981,918      LLC-load-misses           #   31.12% of all LL-cache hits     (50.00%)

Without out the use of scheduling

      106872.881349      task-clock (msec)         #    1.421 CPUs utilized
             1,237      context-switches          #    0.012 K/sec
                69      cpu-migrations            #    0.001 K/sec
         2,262,865      page-faults               #    0.021 M/sec
   254,236,425,448      cycles                    #    2.379 GHz                      (20.01%)
    14,384,218,171      stalled-cycles-frontend   #    5.66% frontend cycles idle     (20.04%)
   163,855,226,466      stalled-cycles-backend    #   64.45% backend cycles idle      (20.03%)
   149,318,162,762      instructions              #    0.59  insn per cycle
                                                  #    1.10  stalled cycles per insn  (30.03%)
    27,627,422,078      branches                  #  258.507 M/sec                    (40.03%)
       213,805,935      branch-misses             #    0.77% of all branches          (50.03%)
    78,495,942,802      L1-dcache-loads           #  734.480 M/sec                    (50.00%)
     2,089,837,393      L1-dcache-load-misses     #    2.66% of all L1-dcache hits    (49.99%)
     3,166,900,999      LLC-loads                 #   29.632 M/sec                    (49.98%)
       929,170,535      LLC-load-misses           #   29.34% of all LL-cache hits     (49.98%)

These are different versions of the *same* algorithm? Can you show the actual algo? I notice they both use nearly identical amounts of CPU time and cycles (and other counters), so apparently only the wall-clock time is different. But 1.4 vs. 1.7 CPUs utilized doesn't explain 4 vs. 14 seconds (you did leave out the total time line of perf output). What hardware did you test on, and was the machine idle at the time? Hyperthreading? — Peter Cordes, Apr 23 '18 at 07:00
It was running on a compute server with Four AMD Opterons. The machine was mostly idle at the time. The only difference between the two is the use of scheduling. I left out the time because the time is inclusive of a pre processing step that both incur but it still differs by 10 seconds. — pad11, Apr 23 '18 at 12:13
Oh, you're the same person that asked [this previous question](https://stackoverflow.com/questions/49907288/understanding-perf-detail-when-comparing-two-different-implementations-of-a-bfs) where these perf outputs aren't exactly what you described in the text. So 1.4 vs. 1.7 CPUs utilized overall could be explained by much more utilization during the 4 / 14 sec you're talking about. You could approximate the results for the interesting part by making your program exit right before the parallel part, and profiling that. — Peter Cordes, Apr 23 '18 at 12:22
Does `schedule(dynamic, 64)` tell OpenMP not to assume that each inner-loop iteration takes the same time? So static scheduling breaks the work into ranges of `u` values, but one thread ends up with more total work than the rest, delaying the total time? If that's possible, and I'm half-remembering / half-guessing what dynamic scheduling is, then that would totally explain it. — Peter Cordes, Apr 23 '18 at 12:25
Yeah, exactly. I guess I don't see why there would be enough load imbalance that the timings would be so different. The only thing I can think of is that some threads end up with far more vertices to check neighbors for than others but there shouldn't be big variance. Or so I thought anyway — pad11, Apr 23 '18 at 12:39
It's your code + data so IDK; `G.getAdjList(u)` taking more time, or more likely `in_edge_counts[u]` being larger on average for some threads, could make a diff. (you don't use `incoming` inside the loop, though...) Or differences in locality for `conntrib[in_edge[j]]` could make a lot of difference, causing cache misses. Perhaps some coarse sorting for chunks of `in_edge[]` would be useful to create a better access pattern, especially for a part that's common to all `u`. — Peter Cordes, Apr 23 '18 at 12:47
When you have different cores competing for memory + last-level cache bandwidth, delay in servicing a request for a cache-line of `in_edge` values would be worse than a delayed response for a line of `conntrib[]`, because the CPU needs `in_edge[]` data before it can queue up loads for more `conntrib` data. (Too bad AMD doesn't have efficient AVX2 gathers, because your data is perfect for `vpgatherdd ymm`. Not that it would help with fairness). — Peter Cordes, Apr 23 '18 at 12:52
Sorry, that was a typo / fixed. in_edges is the same. I'll give this some thought. Thank you! — pad11, Apr 23 '18 at 13:05
Yeah, I thought that was probably an editing error and each different `u` was really using a different `in_edge[]`. Otherwise you could just prefix-sum it (array of running totals) and look up a total for each `u` with no inner loop. — Peter Cordes, Apr 23 '18 at 13:07
I'll think about that as well. Efficiency is not what ultimately after, though. It's more just for understanding why there are differences. I did make changes to my code to what is now to handle the calculations much differently which did have great impact as well. I'm assuming with what you've mentioned could further optimize — pad11, Apr 23 '18 at 13:09

Zulan · Answer 1 · 2018-04-23T09:10:56.967

1

perf stat is not the right tool to understand the performance impact of dynamic scheduling in multi-threaded programs. For that, you need a tool that allows you to distinguish threads and time during the analysis.

perf timechart may be worth a shot, but it is not aware of OpenMP specifically. You will get the best insight by using a analysis tool that is made specifically for OpenMP. Furthermore to get the dynamics, I recommend using a tracing / timeline tool as opposed to a profiling tool that only shows you a summary. Example of such a tool would be Intel VTune or Score-P (tracing) / Vampir (visualization).

edited Apr 23 '18 at 09:10

answered Apr 23 '18 at 08:25

Zulan

21,896
6
49
109

1

Or Intel VTune (which is now available gratis to everyone if you can stand having to ask for a new license every so often https://software.intel.com/en-us/parallel-studio-xe/choose-download ). VTune has specific OpenMP performance profiling whihc show you load imbalance. (FWIW I work for Intel :-)). – Jim Cownie Apr 23 '18 at 09:02
Ah I actually wanted to use Intel VTune but unable to install. I'm sort of restricted in what I can use - Valgrind, Perf and OProfile are really the only options. Not too clear on how to use timechart but I'll give it a look...thanks! – pad11 Apr 23 '18 at 12:26

score 1 · Accepted Answer · answered Apr 23 '18 at 13:06

schedule(dynamic, 64) tells OpenMP not to assume that each inner-loop iteration takes the same time, IIRC.

So static scheduling breaks the work into ranges of u values, but one of your threads must be ending up with more total work than the rest (or takes long for some reason), delaying the total time for all threads to be finished.

It was running on a compute server with Four AMD Opterons. The machine was mostly idle at the time. The only difference between the two is the use of scheduling. I left out the time because the time is inclusive of a pre processing step that both incur but it still differs by 10 seconds.

In that case, the 1.4 vs. 1.7 CPUs utilized overall could be explained by much more utilization during the 4 / 14 sec you're talking about. You could approximate the results for the interesting part by making your program exit right before the parallel part, and profiling that. Subtract those counts from your total counts to get a very rough approximation.

IDK why the work is unbalanced; it's your code + data; G.getAdjList(u) could be taking more time, or more likely in_edge_counts[u] is larger on average for some threads.

Differences in locality for conntrib[in_edge[j]] could make a lot of difference, causing cache misses for scattered reads or cache hits when different in_edge elements were close to previous values.

When you have different cores competing for memory + last-level cache bandwidth, delay in servicing a request for a cache-line of in_edge values would be worse than a delayed response for a line of conntrib[], because the CPU needs in_edge[] data before it knows which cache lines it needs for more conntrib data. So in_edge cache misses reduce memory parallelism, maybe making that thread get less of a share of memory bandwidth.

IDK how much these effects would balance out or not. More likely your data is just not evenly distributed.

Too bad AMD doesn't have efficient AVX2 gathers, because your data is perfect for vpgatherdd ymm. Not that it would help with fairness or anything, just (on Skylake) give a probably-minor speedup vs. scalar gather.

The use of OpenMP dynamic scheduling and its impact on performance - how to find detail

2 Answers2