OpenMP - how to efficiently synchronize field update

Question

I have the following code:

for (int i = 0; i < veryLargeArraySize; i++){
   int value = A[i];
   if (B[value] < MAX_VALUE) {
     B[value]++;
   }
 }

I want to use OpenMP worksharing construct here, but my issue is the synchronization on the B array - all parallel threads can access any element of array B, which is very large (which made use of locks difficult since I'd need too many of them)

#pragma omp critical is a serious overhead here. Atomic is not possible, because of the if.

Does anyone have a good suggestion on how I might do this?

@Zulan They both exceed 1 million elements, with A being slightly larger (it is a histogram calculation of an image - A is the pixelated image, B is the histogram) — Petros17, Nov 02 '16 at 09:42
I would assume that a histogram has *significantly* less bins than the number of original datapoints. Is the histogram very sparse? — Zulan, Nov 02 '16 at 13:00
Why truncate the count (B) inside the loop? Why not simply unconditionally add one, and then, in a separate phase loop over B and truncate the values. Then you can use an atomic. — Jim Cownie, Nov 03 '16 at 10:47
Possible duplicate of [Calculate the histogram with OpenMP](http://stackoverflow.com/questions/21777819/calculate-the-histogram-with-openmp) — Zulan, Nov 03 '16 at 15:13
@JimCownie the array is of unsigned char, and I'm not able to change the type (I'm supposed to parallelize the given code). — Petros17, Nov 03 '16 at 18:59
@Zulan there is no way to know if the histogram will or will not be sparse — Petros17, Nov 03 '16 at 18:59
@Zulan I'll try with the private arrays, but I've already tried multiple similar solutions, and the sequential code was still much faster.. — Petros17, Nov 03 '16 at 19:00
OK, you could hash value to generate an index into an array of locks (the hash value might be as simple as (value & (NUMLOCKS-1)) if you choose NUMLOCKS as a power of two). That way you can balance the amount of space used against the amount of contention. (It's unpleasant to have to use locks explicitly, but I don't see a way to do this with critical sections, since you can't specify anything other than a name). If you're using OpenMP 4.5, you might also want to use a hinted lock and tell the runtime it's likely uncontended. — Jim Cownie, Nov 04 '16 at 09:15
@JimCownie I've tried something similar, but still need to get the right amount of locks. I'll try it, thank you, and I'll post the results — Petros17, Nov 04 '16 at 15:23

score 0 · Accepted Answer · edited May 23 '17 at 12:07

Here's what I've found out and done.

I've read on some forums that parallel histogram calculation is generally a bad idea, since it may be slower and less efficient than the sequential calculation.

However, I needed to do it (for the assignment), so what I did is the following:

Parallel processing of the A array(the image) to determine the actual range of values (the histogram - B array) - find MIN and MAX of A[i]

int min_value, max_value;
#pragma omp for reduction(min:min_value), reduction(max:max_value)
for (i = 0; i < veryLargeArraySize; i++){
     const unsigned int value = A[i];
     if(max_value < value) max_value = value;
     if(min_value > value) min_value = value;
}

int size_of_histo = max_value - min_value + 1;`

That way, we can (potentially) reduce the actual histogram size from, e.g., 1M elements (allocated in array B) to 50K elements (allocated in sharedHisto)

Allocate a shared array, such as:

int num_threads = omp_get_num_threads();
int* sharedHisto = (int*) calloc(num_threads * size_of_histo, sizeof(int));

Each thread is assigned a part of the sharedHisto, and can update it without synchronization

int my_id = omp_get_thread_num();
#pragma omp parallel for default(shared) private(i)
for(i = 0; i < veryLargeArraySize; i++){
    int value = A[i];
    // my_id * size_of_histo positions to the begining of this thread's
    // part of sharedHisto .
    // i - min_value positions to the actual histo value
    sharedHisto[my_id * size_of_histo + i - min_value]++;
}

Now, perform a reduction (as stated here: Reducing on array in OpenMp)

#pragma omp parallel
{
   // Every thread is in charge for part of the reduced histogram
   // shared_histo with the size: size_of_histo
   int my_id = omp_get_thread_num();
   int num_threads = omp_get_num_threads();
   int chunk = (size_of_histo + num_threads - 1) / num_threads;
   int start = my_id * chunk;
   int end = (start + chunk > histo_dim) ? histo_dim : start + chunk;

   #pragma omp for default(shared) private(i, j)
   for(i = start; i < end; i++){
       for(j = 0; j < num_threads; j++){
           int value = B[i + minHistoValue] + sharedHisto[j * size_of_histo + i];

           if(value > MAX_VALUE) B[i + min_value] = MAX_VALUE;
           else B[i + min_value] = value;
       }
    }
 }

OpenMP - how to efficiently synchronize field update

1 Answers1