I made an odd observation. When I active the CXX flag -march=native
the timing of my code snippets becomes twice as long.
// -> sequential
int n = (int) 1e7;
Vector<double, 32> a;
a.init(n);
for (int i = 0; i < n; i++)
a(i) = 1.0;
double r1;
Timer::start();
psum(a.data, n, r1);
Timer::stop();
std::cout << "timing (ms): " << Timer::get_timing() << std::endl;
std::cout << r1 << std::endl;
// <-
// -> threading simple
int n_threads = 2;
Vector<double, 32> b;
b.init(n);
for (int i = 0; i < n; i++)
b(i) = 2.0;
double r2;
Timer::start();
std::thread t1(psum, b.data + n/2, n/2, std::ref(r1));
psum(b.data, n/2, r2);
t1.join();
Timer::stop();
std::cout << "timing (ms): " << Timer::get_timing() << std::endl;
std::cout << r1 + r2 << std::endl;
// <-
Specifically, the threaded example jumps from 8 ms to 16 ms. And 16 ms is the timing of the sequential code.
Extra info:
- g++ 6.2 compiler
- ubuntu 16.10
- intel i5-6200u (skylake)
- vectors are 32-byte aligned
- compile with
c++ -std=c++11 -O3 -pthread ...
Any idea where this comes from?
UPDATE 1
when I activate only -mtune=skylake
then the timing jumps to 32 ms.