2

I have been working through the introductory openmp example and on the first multithreaded example - a numerical integration to pi - I knew the bit about false sharing would be coming and so implemented the following:

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"

#define STEPS 100000000.0
#define MAX_THREADS 4

void pi(double start, double end, double **sum);

int main(){
    double * sum[MAX_THREADS];

    omp_set_num_threads(MAX_THREADS);

    double inc;
    bool set_inc=false;
    double start=omp_get_wtime();
    #pragma omp parallel
    {
        int ID=omp_get_thread_num();
        #pragma omp critical
        if(!set_inc){
            int num_threads=omp_get_num_threads();
            printf("Using %d threads.\n", num_threads);
            inc=1.0/num_threads;
            set_inc=true;
        }

        pi(ID*inc, (ID+1)*inc, &sum[ID]);
    }
    double end=omp_get_wtime();
    double tot=0.0;
    for(int i=0; i<MAX_THREADS; i++){
        tot=tot+*sum[i];

    }
    tot=tot/STEPS;
    printf("The value of pi is: %.8f. Took %f secs.\n", tot, end-start);
    return 0;
}

void pi(double start, double end, double **sum_ptr){
    double *sum=(double *) calloc(1, sizeof(double));
    for(double i=start; i<end; i=i+1/STEPS){
        *sum=*sum+4.0/(1.0+i*i);
    }
    *sum_ptr=sum;
}

My idea was that in using calloc, the probability of the pointers returned being contiguous and thus being pulled into the same cache lines was virtually impossible (though I'm a tad unsure as to why there would be false sharing anyways as double is 64 bit here and my cache lines are 8 bytes as well, so if you can enlighten me there as well...). -- now I realize cache lines are typically 64 bytes not bits

In fun, after compiling I ran the program in quick succession and here's a short example of what I got (definitely was pushing arrows and enter in the terminal more than 1 press/.5 secs):

user@user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.104703 secs.
user@user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.196900 secs.

I thought that maybe something was happening because of the way I tried to avoid the false sharing and since I am still ignorant about the complete happenings amongst the levels of memory I chalked it up to that. So, I followed the prescribed method of the tutorial using a "critical" section like so:

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"

#define STEPS 100000000.0
#define MAX_THREADS 4

double pi(double start, double end);

int main(){
    double sum=0.0;

    omp_set_num_threads(MAX_THREADS);

    double inc;
    bool set_inc=false;
    double start=omp_get_wtime();
    #pragma omp parallel
    {
        int ID=omp_get_thread_num();
        #pragma omp critical
        if(!set_inc){
            int num_threads=omp_get_num_threads();
            printf("Using %d threads.\n", num_threads);
            inc=1.0/num_threads;
            set_inc=true;
        }

        double temp=pi(ID*inc, (ID+1)*inc);
        #pragma omp critical
        sum+=temp;
    }
    double end=omp_get_wtime();

    sum=sum/STEPS;
    printf("The value of pi is: %.8f. Took %f secs.\n", sum, end-start);
    return 0;
}

double pi(double start, double end){
    double sum=0.0;
    for(double i=start; i<end; i=i+1/STEPS){
        sum=sum+4.0/(1.0+i*i);
    }
    return sum;
}

The doubling in run time is virtually identical. What's the explanation for this? Does it have anything to do with the low level memory? Can you answer my intermediate question?

Thanks a lot.

Edit:

The compiler is gcc 7 on Kubuntu 17.10. options used were -fopenmp -W -o ( in that order).

The system specs include an i5 6500 @ 3.2Ghz and 16 gigs of DDR4 RAM (though I forget its clock speed)

As some have asked, the program time does not continue to double if run more than twice in quick succession. After the initial doubling, it remains at around the same time (~.2 secs) for as many successive runs as I have tested (5+). Waiting a second or two, the time to run returns to the lesser amount. However, when the runs are not run manually in succession but rather in one command line such as ./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe; I get:

The value of pi is: 3.14159273. Took 0.100528 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.097707 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.098078 secs.
...

Adding gcc optimization options (-O3) had no change on any of the results.

CircArgs
  • 570
  • 6
  • 16
  • Aside: Consider that `for(double i=start; i – chux - Reinstate Monica Feb 04 '18 at 03:36
  • If you mean the loop misses i=1 I just decided to ignore that... Certainly is close enough and the Ancient Egyptians would be very happy with this estimate – CircArgs Feb 04 '18 at 03:40
  • For your actual question, please include a system description. Particularly CPU, memory, frequency configuration. what happens if you run the program three times or more - just use `./pi_mp;./pi_mp;./pi_mp`... – Zulan Feb 04 '18 at 09:43
  • What's your compiler and more importantly, your compilation options? – Gilles Feb 05 '18 at 07:14
  • @Zulan updated question – CircArgs Feb 05 '18 at 21:18
  • @Gilles updated question – CircArgs Feb 05 '18 at 21:18
  • 3
    You're asking a question about performance without compiling with optimization. Why do you care about the performance without optimization enabled? I doubt anyone else cares about that. Compile with at least `-O2` or `-O3` and check the numbers. – Z boson Feb 06 '18 at 07:56
  • @Zboson (1) no change (see update). (2) I said that when `./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;` is run there is no difference in performance and optimizations weren't done as per the options I showed. – CircArgs Feb 09 '18 at 00:41
  • I can sorta reproduce your results. If I run quickly in successions I can get it to take more than twice. I can mitigate this problem by increasing the size of `STEPS` or by putting `usleep(10000);` before the parallel region is timed. I used `-O3 march=native -fopenmp` because there was a timing bug in `omp_get_wtime` on Skylake systems in the past but it's probably been fixed now with Ubuntu 17.10 https://stackoverflow.com/a/43277581/2542702 – Z boson Feb 12 '18 at 09:08

0 Answers0