Unexpected slowdown using omp

Question

I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.

My first pass was:

vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
   for(j = 0; j < 500; j++){
      count = count + setList[i].count(val[j]);
   }
}

Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:

#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
   currSet = setList[i];
   for(j = 0; j < 500; j++){
      count = count + currSet.count(val[j]);
   }
}

I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?

Thanks!

If you do not use the `count` value afterwards, the compiler might even optimize it away... — Klaas van Gend, Mar 22 '16 at 16:46
I just edited my answer to note that you shouldn't use `schedule(dynamic)` in this code - it's not dynamic at all, there's no load balancing across cores needed. — Klaas van Gend, Apr 07 '16 at 15:59

Klaas van Gend · Answer 1 · 2016-04-07T15:58:35.370

Adding up a few integers is really not enough work to warrant starting threads for.

If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.

I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.

Here is a good example of reduction being used in a similar case to yours from the university of Utah

int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
  sum += array[i];
}
printf("total %d\n", sum);

source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf

as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.

see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses

Data sharing attribute clauses

shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.

private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.

see stack exchange answer here: OpenMP: are local variables automatically private?

Unexpected slowdown using omp

2 Answers2