1

The following function reads data from a file in loops and processes each loaded chunk at a time. To speed up this process, I thought to use openmp in the for loop so that this job is divided between the threads as the following:

void read_process(FILE *fp_read, double *centroids, int total) {

    int i, j, c, dim = 16, chunk_size = 10000, num_itr;
    double *buffer = calloc(total * dim, sizeof(double));
    num_itr = total / chunk_size;
    for (c = 0; c < total; ++c) {
        fread(buffer, sizeof(double), chunk_size * dim, fp_read);
#pragma omp parallel private(i, j)
        {
            #pragma omp for
            for (i = 0; i < chunk_size; i++) {
                for (j = 0; j < dim; j++) {
                    #pragma omp atomic update
                    centroids[j] += buffer[i * dim + j];
                }
            }
        }
    }

    free(buffer);
    fclose(fp_read);
}

Without using openmp, my code works fine. However, adding #pragma section causes the code to stop and show the word Hangup in the terminal without further explanation of what was it hanged for. Some folks in StackOverflow answered other issues related to this error message that it is probably because of race condition but I think it won't be the case here because I am using atomic which serializes the access of the buffer. Am I right? Do you guys see an issue with my code? How can I enhance this code?

Thank you very much.

Elarbi Mohamed Aymen
  • 1,617
  • 2
  • 14
  • 26
steve
  • 153
  • 1
  • 2
  • 9
  • If you have a compiler with OpenMP 4.5 support I think you can do `#pragma omp parallel for private(i, j) reduction(+:centroids[:16])` and remove the atomic pragma. – Z boson Apr 18 '18 at 06:56

1 Answers1

0

What you want to do is an array reduction. If you have a compiler that supports OpenMP 4.5 then you don't need to change your serial code. You can do

#pragma omp parallel for private (j) reduction(+:centroids[:dim])
for(i=0; i <chunck_size; i++) {
  for(j=0; j < dim; j++) {
    centroids[j] += buffer[i*dim +j];
  }
}  

Otherwise you can do the array reduction by hand. Here is one solution

#pragma omp parallel private(j)
{
  double tmp[dim] = {0};
  #pragma omp for
  for(i=0; i < chunck_size; i++) {
    for(j=0; j < dim; j++) {
      tmp[j] += buffer[i*dim +j];
    }
  }
  #pragma omp critical
  for(int i=0; i < dim; i++) centroids[i] += tmp[i];
}

Your current solution is causing massive false sharing as each thread is writing to the same cache line. Both of the solutions above fix this problem by making private versions of centroid for each thread.

As long as dim << chunck_size then these are good solutions.

Z boson
  • 32,619
  • 11
  • 123
  • 226