When running an algorithm that does not use scheduling and uses scheduling, the performance difference is dramatic - with scheduling, the algorithm finishes in 4 seconds and non in 14 seconds. I thought perf would provide some insight as to why this might be occurring but the stats are very similar.
Is it safe to assume that by handling with dynamic scheduling I have addressed some issue with load balancing? I was hoping to find something in the perf detail. Below is the detail in case helpful. Also code used for Pagerank where scheduling is used...
# pragma omp parallel for schedule(dynamic, 64)
for (int u = 0; u < vertex_count; u++) {
int* in_edge = G.getAdjList(u);
double sum = 0.0;
for (int j = 0; j < in_edge_counts[u]; j++) {
int v = in_edge[j];
sum += conntrib[v];
}
pr_temp[u] = sum * damp + adj;
}
With the use of scheduling
107470.977295 task-clock (msec) # 1.743 CPUs utilized
1,187 context-switches # 0.011 K/sec
44 cpu-migrations # 0.000 K/sec
2,279,522 page-faults # 0.021 M/sec
255,920,277,205 cycles # 2.381 GHz (20.00%)
17,116,048,117 stalled-cycles-frontend # 6.69% frontend cycles idle (20.02%)
153,944,352,418 stalled-cycles-backend # 60.15% backend cycles idle (20.02%)
148,412,677,859 instructions # 0.58 insn per cycle
# 1.04 stalled cycles per insn (30.01%)
27,479,936,585 branches # 255.696 M/sec (40.01%)
321,470,463 branch-misses # 1.17% of all branches (50.01%)
78,562,370,506 L1-dcache-loads # 731.010 M/sec (50.00%)
2,075,635,902 L1-dcache-load-misses # 2.64% of all L1-dcache hits (49.99%)
3,100,740,665 LLC-loads # 28.852 M/sec (50.00%)
964,981,918 LLC-load-misses # 31.12% of all LL-cache hits (50.00%)
Without out the use of scheduling
106872.881349 task-clock (msec) # 1.421 CPUs utilized
1,237 context-switches # 0.012 K/sec
69 cpu-migrations # 0.001 K/sec
2,262,865 page-faults # 0.021 M/sec
254,236,425,448 cycles # 2.379 GHz (20.01%)
14,384,218,171 stalled-cycles-frontend # 5.66% frontend cycles idle (20.04%)
163,855,226,466 stalled-cycles-backend # 64.45% backend cycles idle (20.03%)
149,318,162,762 instructions # 0.59 insn per cycle
# 1.10 stalled cycles per insn (30.03%)
27,627,422,078 branches # 258.507 M/sec (40.03%)
213,805,935 branch-misses # 0.77% of all branches (50.03%)
78,495,942,802 L1-dcache-loads # 734.480 M/sec (50.00%)
2,089,837,393 L1-dcache-load-misses # 2.66% of all L1-dcache hits (49.99%)
3,166,900,999 LLC-loads # 29.632 M/sec (49.98%)
929,170,535 LLC-load-misses # 29.34% of all LL-cache hits (49.98%)