0

I am trying to learn how to use OpenMP locks, for which I am testing it on a program that generates a histogram, given some data.

I generate some random numbers with the following:

unsigned long seed = 9876543210 ;

std::mt19937_64 gen {seed} ;

float mu    = 5.0 ;
float sigma = 1.0 ;

std::normal_distribution<float> dist {mu, sigma} ;

int N = 5000 ;

float array[N] ;

for( int i=0; i<N; ++i ) array[i] = dist( gen ) ; // store random numbers

To generate histogram (within a range [min, max]) with OpenMP locks, I do the following:

int bins = 5  ;
int hist[bins] ;
int ival ;

for( ival=0; ival<bins; ++ival ) hist[ival] = 0 ;

omp_lock_t locks[bins] ;

for( ival=0; ival<bins; ++ival ) omp_init_lock(&locks[ival]) ;

float min   = mu - 3*sigma       ;
float max   = mu + 3*sigma       ;
float scale = (max - min) / bins ;

int i ;

#pragma omp parallel for num_threads( 4 )
for( i=0; i<N; ++i ) {

ival = (int) floorf( (array[i] - min) / scale ) ; // bin index

if( ival < 0 || ival >= bins ) continue ;

omp_set_lock(&locks[ival]) ;
hist[ival]++ ;
omp_unset_lock(&locks[ival]) ;

}

for( ival=0; ival<bins; ++ival ) omp_destroy_lock(&locks[ival]) ;

This program takes exceedingly long to run, so I had to quit it before it could finish. The serial version takes an instant and runs just fine.

What am I doing wrong here?

The compilation is done with g++ using the flags:

-std=c++11 -O2 -fopenmp

Thanks in advance.

XCSM
  • 99
  • 6
  • 1
    Use reduction instead of locks: `#pragma omp parallel for num_threads( 4 ) reduction(+:hist[:N]) private(ival)` – Laci Feb 11 '22 at 12:56
  • If you have many bins and not a lot of RAM/thread, atomics might also work fine. – paleonix Feb 11 '22 at 14:26
  • 1
    @Laci Reduction worked with a small correction: ```reduction(+:hist[:bins])``` – XCSM Feb 11 '22 at 14:27
  • Note also that you have a race condition using shared variable `ival`, so you have to set it `private`, or even better declare it inside the parallel region. It also may be one of the reasons of very long runtimes as it can cause 'temporary' deadlocks. – Laci Feb 11 '22 at 14:53

1 Answers1

4

Your use of locks is technically correct in the sense that they do what you intend them to do. However, locks are extremely slow when used in this manner. Highly contested locks like these are always slow. Even when not contested, they require at least one atomic instruction to lock and one to unlock. Those run at a throughput of 1 every 20 cycles or so. Your normal histogram may run at 1 per cycle or one every few cycles (due to load-store-forwarding).

The solution is to use one partial histogram per thread. Then accumulate all histograms at the end. This is described in many answers on SO. See for example

Homer512
  • 9,144
  • 2
  • 8
  • 25