Why is OpenMP outperforming threads?

Question

I've been calling this in OpenMP

#pragma omp parallel for num_threads(totalThreads)
for(unsigned i=0; i<totalThreads; i++)
{
workOnTheseEdges(startIndex[i], endIndex[i]);
}

And this in C++11 std::threads (I believe those are just pthreads)

vector<thread> threads;
for(unsigned i=0; i<totalThreads; i++)
{
threads.push_back(thread(workOnTheseEdges,startIndex[i], endIndex[i])); 
}
for (auto& thread : threads)
{
 thread.join();
}

But, the OpenMP implementation is 2x the speed--Faster! I would have expected C++11 threads to be faster, as they are more low-level. Note: The code above is being called not just once, but probably 10,000 times in a loop, so maybe that has something to do with it?

Edit: for clarification, in practice, I either use the OpenMP or the C++11 version--not both. When I am using the OpenMP code, it takes 45 seconds and when I am using the the C++11, it takes 100 seconds.

My crystal ball fails to reveal what the value of `totalThreads` is, how many cores/HW threads your CPU has, what the size of `startIndex` is and how much time it takes to execute `workOnTheseEdges()` once. — Hristo Iliev, Apr 23 '14 at 21:50
workOnTheseEdges simply goes through a for loop, essentially adding 10,000 times. startIndex is an unsigned integer. totalThreads is 16 and so is the size of startIndex, and the CPU has 16 cores. But, I just am wondering why OpenMP is faster, as they are doing the same thing. — user2588666, Apr 23 '14 at 21:53
@KerrekSB: Are you saying that when that omp code block is called in a for loop, the threads stick around instead of being created and destroyed. But the c++11 code has the threads being created and destroyed in each iteration? — user2588666, Apr 23 '14 at 21:56
They aren't doing the same thing. The OpenMP version is distributing 10,000 tasks over 16 threads. The C++11 version is running 10,000 tasks on 10,000 threads. Threads are expensive, and having more threads than cores is even more expensive. You can't just throw new threads at every small task (unless you happen to have 10,000 or so cores to run them on). The OpenMP version is taking care of this for you. — adpalumbo, Apr 23 '14 at 21:56
@user2588666: No, I'm saying that OpenMP is smarter about distributing work in a realistic fashion. — Kerrek SB, Apr 23 '14 at 21:56
@adpalumbo: I could see where you're gathering that from, but you're wrong. I should alter the code, but startIndex.size() is equal to totalThreads. — user2588666, Apr 23 '14 at 21:59
@kerrekSB: Why would the OpenMP implementation be distributed differently from my c++11 threads? The same number of threads are created in both instances. — user2588666, Apr 23 '14 at 22:15
@user2588666: You said "the above code is called in a loop". Each and every time it's called, the `std::thread` version creates `totalThreads` _new_ threads, but the `OpenMP` is reusing the same 16 each time the loop executes. — Mooing Duck, Apr 23 '14 at 22:15
@MooingDuck: That makes sense; that was my hunch, but I didn't expect OpenMP to be that advanced. Appreciated :) — user2588666, Apr 23 '14 at 22:17
@user2588666: Visual studio implements `std::async` to reuse the same threads. Other than that you'd have to manage the 16 threads yourself (which is pretty easy for your case. Make the `threads` vector static: http://coliru.stacked-crooked.com/a/3fdad471c0c26d41) — Mooing Duck, Apr 23 '14 at 22:20
@MooingDuck: Thanks for the hints. That's really interesting about std::async in Visual Studio. Making the threads vector static didn't make a noticeable difference (<1% speedup, if any), and I'm compiling on a Linux machine so async didn't give an improvement either. But, again, thanks for your comments! — user2588666, Apr 23 '14 at 22:44
What is the value of totalThreads? How are you doing the timing? Also, once you use OpenMP the overhead for the threads for the next iteration is much lower. The C++threads version may be killing the threads every iteration. Try restarting the threads each iteration rather than creating them again. — Z boson, Apr 25 '14 at 07:39

score 5 · Answer 1 · answered Apr 23 '14 at 21:51

5

Where does totalThreads come from in your OpenMP version? I bet it's not startIndex.size().

The OpenMP version queues the requests onto totalThreads worker threads. It looks like the C++11 version creates, startIndex.size() threads, which involves a ridiculous amount of overhead if that's a big number.

answered Apr 23 '14 at 21:51

adpalumbo

3,031
12
12

Sorry for the confusion, but startIndex.size() is equal to totalThreads. – user2588666 Apr 23 '14 at 21:58

score 3 · Accepted Answer · answered Apr 25 '14 at 05:46

Consider the following code. The OpenMP version runs in 0 seconds while the C++11 version runs in 50 seconds. This is not due to the function being doNothing, and it's not due to vector being within the loop. As you can imagine, the c++11 threads are created and then destroyed in each iteration. On the other hand, OpenMP actually implements threadpools. It's not in the standard, but it's in Intel's and AMD's implementations.

for(int j=1; j<100000; ++j)
{
    if(algorithmToRun == 1)
    {
        vector<thread> threads;
        for(int i=0; i<16; i++)
        {
            threads.push_back(thread(doNothing));
        }
        for(auto& thread : threads) thread.join();
    }
    else if(algorithmToRun == 2)
    {
        #pragma omp parallel for num_threads(16)
        for(unsigned i=0; i<16; i++)
        {
            doNothing();
        }
    }
}

Instead of creating/destroying the threads each iteration of `j` there is probably a way to define them outside of the main loop and restart them for each `j`. — Z boson, Apr 25 '14 at 07:41

Why is OpenMP outperforming threads?

2 Answers2

Linked