C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

Question

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.

#include <vector>
#include <cmath>
#include <omp.h>

// Count the number of bits
inline int bitCount(uint64_t n){

  int count = 0;

  while(n){

    n &= (n-1);
    count++;

  }

  return count;

}


void test(){

  int nthreads = omp_get_max_threads();
  omp_set_dynamic(0);
  omp_set_num_threads(nthreads);

  // I need a priority queue per thread
  std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
  std::vector<uint64_t> vals(100,1);

  # pragma omp parallel for shared(mat,vals)
  for(int i = 0; i < 100000000; i++){
    std::vector<double> &tid_vec = mat[omp_get_thread_num()];
    int total_count = 0;
    for(unsigned int j = 0; j < vals.size(); j++){
      total_count += bitCount(vals[j]);
      tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
    }
  }

}

This code runs in about 11 seconds. If I comment out the following line:

tid_vec[j] = total_count;

the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?

Depending on your compiler and options, the inner loop reduction could be simd vectorized when you remove the serializing store. — tim18, Feb 24 '17 at 00:29
It is also true that without the store then the for loop doesn't do anything. Maybe it is optimized out? — Robert Prévost, Feb 24 '17 at 00:43
If you want a specific answer instead of just guesses you must provide details on the compiler version, option, hardware and a [mcve]. Also note that `bitcount` is widely known as `popcnt` and has been optimized to oblivion. — Zulan, Feb 24 '17 at 07:35
OT, but... 1) Many X86 processors now have a popcnt instruction, 2) If you don't want to use that, at least use the classic, popcnt that needs a constant six stages to handle 64b, rather than the up-to 64 you have. E.g. http://stackoverflow.com/questions/19729466/how-to-find-number-of-1s-in-a-binary-number-in-o1-time — Jim Cownie, Feb 24 '17 at 09:31

Jorge Bellon · Accepted Answer · 2017-02-24T11:09:04.847

Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.

If you comment the line:

tid_vec[j] = total_count;

The compiler will optimize away all the computations whose result is not used. Therefore:

  total_count += bitCount(vals[j]);

is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.

On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.

As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.

# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
  int total_count = 0;
  for(unsigned int j = 0; j < vals.size(); j++){
    total_count += bitCount(vals[j]);
    local_vec[j] = total_count;
  }
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

1 Answers1