0

I'm trying to learn OpenMP to parallelize a part of my code and I'm trying to figure out why it's not faster when using 2 threads instead of 1. Here's a minimal working example of the code:

#include <iostream>
#include <omp.h>

using namespace std;

class My_class
{
    public :

        // Constructor
        My_class(int nuIterations) 
            : prVar_(0),
              nuIters_(nuIterations)
        {} // Empty

        // Do something expensive involving the class' private vars
        void do_calculations()
        {
            for (int i=0;i<nuIters_;++i){
                prVar_=prVar_+i+2*i+3*i+4*i-5*i-4*i;
            }
        }

        // Retrieve result
        double getResult()
        {
            return prVar_;
        }

    private:

        double prVar_;
        int nuIters_;

};

int main()
{
    // Initialize one object for every thread
    My_class *test_object1, *test_object2;
    test_object1 = new My_class(1000000000);
    test_object2 = new My_class(500000000);

    // Set number of threads (use one line at a time)
    omp_set_num_threads(1); // One thread executes in 11.5 real seconds
    //omp_set_num_threads(2); // Two threads execute in 13.2 real seconds
    double start = omp_get_wtime(); // Start timer
#pragma omp parallel sections // Do calculations in parallel
    {
#pragma omp section
        {
            test_object1->do_calculations();
        }
#pragma omp section
        {
            test_object2->do_calculations();
        }
    }// End of parallel sections
    // Print results
    double end = omp_get_wtime();
    cout<<"Res 1 : "<<test_object1->getResult()<<endl;
    cout<<"Res 2 : "<<test_object2->getResult()<<endl;
    cout<<"Time  : "<<end-start<<endl;

    return 0;
}

Compiling and running this using g++ myomp.cpp -O0 -std=c++11 -fopenmp gives the following execution time for 1 and 2 threads:

  1. 1 thread : 11.5 seconds
  2. 2 threads: 13.2 seconds

Is there some way I can speed this up for 2 threads? I am running this on a 4-core Intel i7-4600U and Ubuntu.

EDIT: Changed most of the post such that it follows the guidlines.

Nikos Kazazakis
  • 792
  • 5
  • 19
  • 2
    You have to give us more information in form of an [mcve] plus your hardware specifications, otherwise an answer is just guessing. Guesses include: Writing to shared cache lines, being memory bound, implicit synchronization, usage of shared resources that you are not aware of or combinations thereof. – Zulan May 18 '16 at 07:23
  • Thanks for your comment, I'll try to work out an appropriate example and edit the post! – Nikos Kazazakis May 18 '16 at 20:24
  • Done, hope it makes sense now! – Nikos Kazazakis May 18 '16 at 21:06
  • If it's any consolation, it takes 8.9s on my iMac with 1 thread and 5.6s with 2 threads. I used `-O3` to compile. – Mark Setchell May 18 '16 at 21:17
  • I turned off optimizations on purpose to avoid affecting the result, my actual code is compiled using O3 but it's difficult to reproduce.. :( – Nikos Kazazakis May 18 '16 at 21:19
  • Pro Tip: Don't benchmark with optimizations disabled unless you actually run your code in production with optimizations disabled. Unless you're trying to measure the impact of compiler optimizations itself. – Mysticial May 18 '16 at 21:30

1 Answers1

2

There are two effects here:

  1. Cache line contention: You have two very small objects that are allocated in dynamic memory. If they end up in the same cache line (usually 64 byte), the threads that want to update prVar_ will both compete for the level 1 cache, because they need exclusive (write) access. You should have observed this randomly: sometimes it is significantly faster / slower depending on the memory location. Try to print the pointer addresses and divide them by 64. To address this issue, you need to pad / align the memory.

  2. You have a huge load imbalance. One task is simply computing twice as much work, so even under idealized conditions, you will only achieve a speedup of 1.5.

Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • I added a 1,000 element array of doubles to the private variables (to try and separate the two objects further than one cache line) and it made no difference at all... – Mark Setchell May 19 '16 at 09:25
  • What does "no difference" mean? If you already get the 1.5x speedup, then you are likely not affected by 1) in the first place. – Zulan May 19 '16 at 11:09
  • Agreed. I was trying to say that there doesn't seem to be any cache line contention, but I phrased it badly. I also tried making the two workloads identically sized and the speedup was indeed 2-fold. So, on my machine at least, it appears to be almost exclusively your second effect. Have my vote :-) – Mark Setchell May 19 '16 at 11:36