I have two unrelated for
loops, one is executed serial and one is executed with an OpenMP parallel for construct.
The next serial code becomes slower the more OpenMP-Threads I'm using.
class Foo {
public:
Foo(size_t size) {
parallel_vector.resize(size, 0.0);
serial_vector.resize(size, 0.0);
}
void do_serial_work() {
std::mt19937 random_number_generator;
std::uniform_real_distribution<double> random_number_distribution{ 0.0, 1.0 };
for (size_t i = 0; i < serial_vector.size(); i++) {
serial_vector[i] = random_number_distribution(random_number_generator);
}
}
void do_parallel_work() {
#pragma omp parallel for
for (auto i = 0; i < parallel_vector.size(); ++i) {
for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
}
}
}
private:
std::vector<double> parallel_vector;
std::vector<double> serial_vector;
};
void test_with_size(size_t size, int num_threads) {
std::cout << "Testing with " << num_threads << " and size: " << size << "\n";
omp_set_num_threads(num_threads);
Foo foo{ size };
long long total_dur_1 = 0;
long long total_dur_2 = 0;
for (auto i = 0; i < 500; i++) {
const auto tp_1 = std::chrono::high_resolution_clock::now();
foo.do_serial_work();
const auto tp_2 = std::chrono::high_resolution_clock::now();
foo.do_parallel_work();
const auto tp_3 = std::chrono::high_resolution_clock::now();
const auto dur_1 = std::chrono::duration_cast<std::chrono::microseconds>(tp_2 - tp_1).count();
const auto dur_2 = std::chrono::duration_cast<std::chrono::microseconds>(tp_3 - tp_2).count();
total_dur_1 += dur_1;
total_dur_2 += dur_2;
}
std::cout << total_dur_1 << "\t" << total_dur_2 << "\n";
}
int main(int argc, char** argv) {
test_with_size(100000, 1);
test_with_size(100000, 2);
test_with_size(100000, 4);
test_with_size(100000, 8);
return 0;
}
The slowdown happens on my local machine, a Win10 Laptop having an Intel Core i7-7700 with 4 cores and hyperthreading, 24 GB of RAM. The compiler is the latest in VisualStudio 2019. Compiled in RelWithDebugMode (from CMake, include /O2
and /openmp
).
It does not happen when I use a stronger machine, a CentOS 8 with 2x Intel Xeon Platinum 9242 with 48 cores each, no hyperthreading, 769 GB of RAM. The compiler is gcc/8.3.1. Compiled with g++ --std=c++17 -O3 -fopenmp
.
Timings on Win10 i7-7700:
Testing with 1 and size: 100000
3043846 10536315
Testing with 2 and size: 100000
3276611 5350204
Testing with 4 and size: 100000
3937311 2735655
Testing with 8 and size: 100000
5002727 1598775
and on CentOS 8, 2x Xeon Platinum 9242:
Testing with 1 and size: 100000
727756 4111363
Testing with 2 and size: 100000
731649 2069257
Testing with 4 and size: 100000
734019 1056157
Testing with 8 and size: 100000
752584 544373
So my initial thought was "There's too much pressure on the cache". However, when I removed virtually everything from the parallel section but the loop, the slowdown happened again.
Updated parallel section with the work taken out:
void do_parallel_work() {
#pragma omp parallel for
for (auto i = 0; i < 8; ++i) {
//for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
// parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
//}
}
}
Timings on Win10 with updated parallel section:
Testing with 1 and size: 100000
3206293 636
Testing with 2 and size: 100000
3218667 2672
Testing with 4 and size: 100000
3928818 8689
Testing with 8 and size: 100000
5106605 10797
Looking into the OpenMP 2.0 standard (VS does only support 2.0) (find it here: https://www.openmp.org/specifications/), it says in 2.7.2.5 lines 7,8:
In the absence of an explicit default clause, the default behavior is the same as if the default(shared) were specified.
And in 2.7.2.4 line 30:
All threads within the team access the same storage area for shared variables.
For me, this rules out that the OpenMP threads each copy serial_vector, which was the last explanation I could think of.
I'm happy for any explanation/ discussion on that matter, even if I just plainly missed something.
EDIT:
Out of curiosities sake, I also tested on my Win10 machine with the WSL. Runs gcc/9.3.0, and the timings are:
Testing with 1 and size: 100000
833678 2752
Testing with 2 and size: 100000
762877 1863
Testing with 4 and size: 100000
816440 1860
Testing with 8 and size: 100000
991184 2350
I'm honestly not sure why the windows executable takes so much longer an the same machine as the linux one (optimization /O2 for VC++ is max), but funnily enough, the same artifacts don't happen here.