I am trying to learn more about OpenMP and cache contention, so I wrote a simple program to better understand how it works. I am getting bad thread scaling for a simple addition of vectors, but I don't understand why. This is my program:
#include <iostream>
#include <omp.h>
#include <vector>
using namespace std;
int main(){
// Initialize stuff
int nuElements=20000000; // Number of elements
int i;
vector<int> x, y, z;
x.assign(nuElements,0);
y.assign(nuElements,0);
z.assign(nuElements,0);
double start; // Timer
for (i=0;i<nuElements;++i){
x[i]=i;
y[i]=i;
}
// Increase the threads by 1 every time, and add the two vectors
for (int t=1;t<5;++t){
// Re-set z vector values
z.clear();
// Set number of threads for this iteration
omp_set_num_threads(t);
// Start timer
start=omp_get_wtime();
// Parallel for
#pragma omp parallel for
for (i=0;i<nuElements;++i)
{
z[i]=x[i]+y[i];
}
// Print wall time
cout<<"Time for "<<omp_get_max_threads()<<" thread(s) : "<<omp_get_wtime()-start<<endl;
}
return 0;
}
Running this produces the following output:
Time for 1 thread(s) : 0.020606
Time for 2 thread(s) : 0.022671
Time for 3 thread(s) : 0.026737
Time for 4 thread(s) : 0.02825
I compiled with this command : clang++ -O3 -std=c++11 -fopenmp=libiomp5 test_omp.cpp
As you can see, the scaling just gets worse as the number of threads increases. I am running this on a 4-core Intel-i7 processor. Does anyone know what's happening?