Bad scaling with OpenMP (cache contention?)

Question

I am trying to learn more about OpenMP and cache contention, so I wrote a simple program to better understand how it works. I am getting bad thread scaling for a simple addition of vectors, but I don't understand why. This is my program:

#include <iostream>
#include <omp.h>
#include <vector>

using namespace std;

int main(){

    // Initialize stuff
    int nuElements=20000000; // Number of elements
    int i;
    vector<int> x, y, z;
    x.assign(nuElements,0);
    y.assign(nuElements,0);
    z.assign(nuElements,0);
    double start; // Timer

    for (i=0;i<nuElements;++i){
       x[i]=i;
       y[i]=i;
    }    

    // Increase the threads by 1 every time, and add the two vectors  
    for (int t=1;t<5;++t){

        // Re-set z vector values
        z.clear();

        // Set number of threads for this iteration
        omp_set_num_threads(t);

        // Start timer
        start=omp_get_wtime();

        // Parallel for
#pragma omp parallel for
        for (i=0;i<nuElements;++i)
        {
            z[i]=x[i]+y[i];
        }
        // Print wall time
        cout<<"Time for "<<omp_get_max_threads()<<" thread(s) : "<<omp_get_wtime()-start<<endl;
    }
    return 0;
}

Running this produces the following output:

Time for 1 thread(s) : 0.020606
Time for 2 thread(s) : 0.022671
Time for 3 thread(s) : 0.026737
Time for 4 thread(s) : 0.02825

I compiled with this command : clang++ -O3 -std=c++11 -fopenmp=libiomp5 test_omp.cpp

As you can see, the scaling just gets worse as the number of threads increases. I am running this on a 4-core Intel-i7 processor. Does anyone know what's happening?

Look at [this post](http://stackoverflow.com/a/11579987/5239503). This explains why you won't see much performance improvement on your memory-bound problem with one single socket / NUMA node. Here you already achieve roughly 12 GB/s of throughput with one core (code probably well vectorised). Check the specs of your hardware but I guess you can't do much better. — Gilles, May 25 '16 at 04:29
Can you please provide information about your exact memory configuration and CPU model. — Zulan, May 25 '16 at 18:12

Matt Timmermans · Answer 1 · 2016-05-25T04:23:51.480

2

You are limited by memory bandwidth, not CPU speed. It only takes one CPU to keep your memory busy if all you're doing is addition and copying, so adding more cores doesn't help.

If you want to see the benefit of adding more threads, try executing more complex operations on memory that is small enough to fit in the L1 or L2 cache.

edited May 25 '16 at 04:23

answered May 25 '16 at 03:42

Matt Timmermans

53,709
3
46
87

1

While you are probably basically right, this is a bad generalization: "*It only takes one CPU to keep your memory busy if all you're doing is addition and copying, so adding more cores doesn't help.*". It is simply not true for current systems. There is also a huge difference between desktop systems and between high performance systems, the latter benefit much more from multiple threads even when memory-bound. Please see [this answer](http://stackoverflow.com/a/36824169/620382) or [the discussion on this answer](http://stackoverflow.com/a/37099530/620382) for details. – Zulan May 25 '16 at 19:22
@Zulan can you share why that's true? What exactly are those threads going to do if there's no way for them to communicate with the GPU? – xaxxon May 26 '16 at 00:59
@Zulan, the when I tell the OP that it only takes one CPU to keep his memory busy, I'm referring to *his* memory and *his* CPU, not generalizing to all current systems. – Matt Timmermans May 26 '16 at 02:43
@xaxxon, I don't understand the question. What do GPUs have to do with this? – Zulan May 26 '16 at 07:54
1

@MattTimmermans, 1) Without specific information about his memory, it is a (good and probably correct) guess. 2) While you are technically referring to *his* system, it's very easy to read it as Generalization. I think your answer could be improved by being clearer about that. – Zulan May 26 '16 at 08:28
@Zulan oops, I didn't know what openmp was. I thought it was a GPU programming thing. – xaxxon May 26 '16 at 19:35

Bad scaling with OpenMP (cache contention?)

1 Answers1