32

In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.

#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>

using namespace std;

void doNothing() {}

int run(int algorithmToRun)
{
    auto startTime = std::chrono::system_clock::now();

    for(int j=1; j<100000; ++j)
    {
        if(algorithmToRun == 1)
        {
            vector<thread> threads;
            for(int i=0; i<16; i++)
            {
                threads.push_back(thread(doNothing));
            }
            for(auto& thread : threads) thread.join();
        }
        else if(algorithmToRun == 2)
        {
            #pragma omp parallel for num_threads(16)
            for(unsigned i=0; i<16; i++)
            {
                doNothing();
            }
        }
    }

    auto endTime = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = endTime - startTime;

    return elapsed_seconds.count();
}

int main()
{
    int cppt = run(1);
    int ompt = run(2);

    cout<<cppt<<endl;
    cout<<ompt<<endl;

    return 0;
}
joce
  • 9,624
  • 19
  • 56
  • 74
user2588666
  • 821
  • 1
  • 7
  • 16
  • 1
    My guess is that OpenMP is smart enough to optimize out the whole loop since it's a NOP. With ``threads`` you're suffering the overhead of spinning up and tearing down all those NOP threads. Try adding some actual code to the test function and see what happens. – aruisdante Apr 24 '14 at 01:16
  • Well, one thing is that you're using a dynamically resizing container to hold the threads; that can't help with performance. – RamblingMad Apr 24 '14 at 01:17
  • Try just using a fixed sized array and initiating all its elements when created. – RamblingMad Apr 24 '14 at 01:17
  • @aruisdante: I have added real code, and I can assure you the difference persists (I had lots of code and factored it down to post on here)--it's not due to the NOP. – user2588666 Apr 24 '14 at 01:18
  • @CoffeeandCode: I've done that (and just tried again), and the difference is negligible, as the call to thread() calls new anyway. Good point though--But I also can assure you that that does not affect the performance. – user2588666 Apr 24 '14 at 01:22
  • likely comes down to: https://stackoverflow.com/questions/3949901/pthreads-vs-openmp on Linux – Ciro Santilli OurBigBook.com Sep 06 '17 at 13:29

2 Answers2

37

OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.

aruisdante
  • 8,875
  • 2
  • 30
  • 37
  • Thanks. This _has_ to be the answer, but wouldn't you almost think that it would require a #pragma around the outer for loop? Also, how do you know that as a fact--I don't see information about it in the documentation, even in the linked site. I'm sure that you're right, I just want to back up the information. I haven't ever read, as a fact, that they do that. – user2588666 Apr 24 '14 at 01:26
  • Check out the second link, there's some talk in there. I can try and find more solid documentation, I know I've read it explicitly somewhere. – aruisdante Apr 24 '14 at 01:27
  • [here](http://openmp.org/forum/viewtopic.php?f=3&t=136) is another discussion about it. Basically, it's actually not defined by the OpenMP standard, but most implementations on most platforms seem to do it if it's more efficient. – aruisdante Apr 24 '14 at 01:29
  • Thanks again :). I figured it had to be threadpools, but I just surprisingly couldn't find it stated anywhere. After looking some more, I found [this](https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/optaps/common/optaps_par_multicore_thrdpool.htm). I'm going to try it on a non-intel machine and see if it still holds true. You beat me to it--It does look like it's basically done in all implementations. – user2588666 Apr 24 '14 at 01:31
  • PS. I can confirm the difference also exists on AMD machines. – user2588666 Apr 24 '14 at 01:40
2

I tried a code of an 100 looping at Choosing the right threading framework and it took OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.

I also applied what AruisDante suggested;

void nested_loop(int max_i, int band)  
{
    for (int i = 0; i < max_i; i++)
    {
        doNothing(band);
    }
}
...
else if (algorithmToRun == 5)
{
    thread bristle(nested_loop, max_i, band);
    bristle.join();
}

This code looks like taking less time than your original C++ 11 thread section.

Cloud Cho
  • 1,594
  • 19
  • 22