I've tried using OpenMP with a single #pragma omp parallel for
, and it resulted in my programme going from a runtime of 35s (99.6% CPU) to 14s (500% CPU), running on Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz. That's the difference between compiling with g++ -O3
and g++ -O3 -fopenmp
, both with gcc (Debian 4.7.2-5) 4.7.2
on Debian 7 (wheezy).
Why is it only using 500% CPU at most, when the theoretical maximum would be 800%, since the CPU is 4 core / 8 threads? Shouldn't it be reaching at least low 700s?
Why am I only getting a 2.5x improvement in overall time, yet at a cost of 5x in CPU use? Cache thrashing?
The whole programme is based on C++ string
manipulation, with recursive processing (using a lot of .substr(1)
and some concatenation), where said strings are continuously inserted into a vector
of set
.
In other words, basically, there are about 2k loop iterations done in a single parallel for loop, operating on vector
, and each one of them may do two recursive calls to itself w/ some string
.substr(1)
and + char
concatenation, and then the recursion terminates with set
.insert
of either a single string or a concatenation of two strings, and the said set
.insert
also takes care of a significant number of duplicates that are possible.
Everything runs correctly and well within the spec, but I'm trying to see if it can run faster. :-)