Cannot get 100% CPU usage from thread class

Question

I tried to write an answer to How to get 100% CPU usage from a C program question by thread class. Here is my code

#include <iostream>
#include <thread>
#include <vector>
#include <mutex>

using namespace std;

static int primes = 0;
void prime(int a, int b);
mutex mtx;

int main() 
{
  unsigned int nthreads = thread::hardware_concurrency();
  vector<thread> threads;
  int limit = 1000000;
  int intrvl = (int) limit / nthreads;

  for (int i = 0; i < nthreads; i++)
  {
      threads.emplace_back(prime, i*intrvl+1, i*intrvl+intrvl);
  }

  cout << "Number of logical cores: " << nthreads << "\n";
  cout << "Calculating number of primes less than " << limit << "... \n";

  for (thread & t : threads) {
    t.join();
  }

  cout << "There are " << primes << " prime numbers less than " << limit << ".\n";

  return 0;
}

void prime(int a, int b) 
{
    for (a; a <= b; a++) { 
        int i = 2; 
        while(i <= a) { 
            if(a % i == 0)
                break;
            i++; 
        }
        if(i == a) {
            mtx.lock();
            primes++;
            mtx.unlock();
        }
    }
}

But when I run it I get the following diagram

That is sinusoid. But when I run @Mysticial answer that uses openmp, I get this

I checked both program by ps -eLf and both of them uses 8 threads. Why I get this unsteady diagram and how can I get the same result as openmp does with thread?

Hey Dante, You might have to set cpu affinity your self to optimize your program for max performance. Take a look at http://linux.die.net/man/2/sched_setaffinity . But I would assume I should have been done by the OS if no other process is consuming CPU time as it is on your first graph. — printfmyname, Jun 05 '16 at 01:22
The only guaranteed way to achieve 100% CPU utilization is to avoid using any system calls. As soon as you make a system call, all bets are off, since the system call may result in the thread getting paused, for arbitrary reason. And here, "system call" also includes your mutex operations. — Sam Varshavchik, Jun 05 '16 at 01:24
It probably isn't that important, but I think there is a bug in this program which will cause it to undercount the number of primes. The issue is if limit is not divisible by the number of threads, you will miss some primes as you only count up to threads * floor(limit / threads). (The floor coming from the integer division). — , Jun 05 '16 at 01:34
@Lalaland But the problem exists even if limit is divisible by the number of threads. — Dante, Jun 05 '16 at 01:43

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

There are some fundamental differences between Mystical's answer and your code.

Difference #1

Your code creates a chunk of work for each CPU, and lets it run to completion. This means that once a thread has finished, there will be a sharp drop in the CPU usage since a CPU will be idle while the other threads run to completion. This happens because scheduling is not always fair. One thread may progress, and finish, much faster than the others.

The OpenMP solution solves this by declaring schedule(dynamic) which tells OpenMP to, internally, create a work queue that all the threads will consume work from. When a chunk of work is finished, the thread that would have then exited in your code consumes another chunk of work and gets busy with it.

Eventually, this becomes a balancing act of picking adequately sized chunks. Too large, and the CPUs may not be maxed out toward the end of the task. Too small, and there can be significant overhead.

Difference #2

You are writing to a variable, primes that is shared between all of the threads. This has 2 consequences:

It requires synchronization to keep prevent a data race.
It makes the cache on a modern CPU very unhappy since a cache flush is required before writes on one thread are visible to another thread.

The OpenMP solution solves this by reducing, via operator+(), the result of the individual values of primes each thread held into the final result. This is what reduction(+ : primes) does.

With this knowledge of how OpenMP is splitting up, scheduling the work, and combining the results, we can modify your code to behave similarly.

#include <iostream>
#include <thread>
#include <vector>
#include <utility>
#include <algorithm>
#include <functional>
#include <mutex>
#include <future>

using namespace std;

int prime(int a, int b)
{
    int primes = 0;
    for (a; a <= b; a++) {
        int i = 2;
        while (i <= a) {
            if (a % i == 0)
                break;
            i++;
        }
        if (i == a) {
            primes++;
        }
    }
    return primes;
}


int workConsumingPrime(vector<pair<int, int>>& workQueue, mutex& workMutex)
{
    int primes = 0;
    unique_lock<mutex> workLock(workMutex);
    while (!workQueue.empty()) {
        pair<int, int> work = workQueue.back();
        workQueue.pop_back();

        workLock.unlock(); //< Don't hold the mutex while we do our work.
        primes += prime(work.first, work.second);
        workLock.lock();
    }
    return primes;
}


int main()
{
    int nthreads = thread::hardware_concurrency();
    int limit = 1000000;

    // A place to put work to be consumed, and a synchronisation object to protect it.
    vector<pair<int, int>> workQueue;
    mutex workMutex;

    // Put all of the ranges into a queue for the threads to consume.
    int chunkSize = max(limit / (nthreads*16), 10); //< Handwaving came picking 16 and a good factor.
    for (int i = 0; i < limit; i += chunkSize) {
        workQueue.push_back(make_pair(i, min(limit, i + chunkSize)));
    }

    // Start the threads.
    vector<future<int>> futures;
    for (int i = 0; i < nthreads; ++i) {
        packaged_task<int()> task(bind(workConsumingPrime, ref(workQueue), ref(workMutex)));
        futures.push_back(task.get_future());
        thread(move(task)).detach();
    }

    cout << "Number of logical cores: " << nthreads << "\n";
    cout << "Calculating number of primes less than " << limit << "... \n";

    // Sum up all the results.
    int primes = 0;
    for (future<int>& f : futures) {
        primes += f.get();
    }

    cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
}

This is still not a perfect reproduction of how the OpenMP example behaves. For example, this is closer to OpenMP's static schedule since chunks of work are a fixed size. Also, OpenMP does not use a work queue at all. So I may have lied a little bit -- call it a white lie since I wanted to be more explicit about showing the work being split up. What it is likely doing behind the scenes is storing the iteration that the next thread should start at when it comes available and a heuristic for the next chunk size.

Even with these differences, I'm able to max out all my CPUs for an extended period of time.

Looking to the future...

You probably noticed that the OpenMP version is a lot more readable. This is because it's meant to solve problems just like this. So, when we try to solve them without a library or compiler extension, we end up reinventing the wheel. Luckily, there is a lot of work being done to bring this sort of functionality directly into C++. Specifically, the Parallelism TS can help us out if we could represent this as a standard C++ algorithm. Then we could tell the library to distribute the algorithm across all CPUs as it sees fit so it does all the heavy lifting for us.

In C++11, with a little bit of help from Boost, this algorithm could be written as:

#include <iostream>
#include <iterator>
#include <algorithm>
#include <boost/range/irange.hpp>

using namespace std;

bool isPrime(int n)
{
    if (n < 2)
        return false;

    for (int i = 2; i < n; ++i) {
        if (n % i == 0)
            return false;
    }
    return true;
}

int main()
{
    auto range = boost::irange(0, 1000001);
    auto numPrimes = count_if(begin(range), end(range), isPrime);
    cout << "There are " << numPrimes << " prime numbers less than " << range.back() << ".\n";
}

And to parallelise the algorithm, you just need to #include <execution_policy> and pass std::par as the first parameter to count_if.

auto numPrimes = count_if(par, begin(range), end(range), isPrime);

And that's the kind of code that makes me happy to read.

Note: Absolutely no time was spent optimising this algorithm at all. If we were to do any optimisation, I'd look into something like the the Sieve of Eratosthenes which uses previous prime computations to help with future ones.

Jerry Coffin · Answer 2 · 2016-06-05T05:17:20.913

First, you need to realize that OpenMP usually has a fairly sophisticated thread pool under the covers, so matching it (exactly) will probably be at least somewhat difficult.

Second, it seems to me that before optimizing the threading, you should attempt to start with at least a halfway decent basic algorithm. In this case, the basic algorithm you're implementing is basically pretty awful. It's checking whether numbers are prime, but doing a lot of work that doesn't accomplish anything useful.

It's checking whether even numbers are prime. Other than 2, they're not. Ever.
It's checking whether odd numbers are divisible by even number. Again, they're not. Ever.
It's checking whether numbers are divisible by numbers larger then their square root. If there's no divisor smaller than the square root, there can't be one larger than the square root either.

Although it probably doesn't affect speed, I also find it a lot easier to have a function that checks whether a single number is prime, and just returns true/false to indicate the result, than to have somewhat elaborate code to figure out whether a preceding loop ran to completion or exited early.

You can optimize the algorithm by eliminating more than that, but that much doesn't strike me as "optimization" nearly so much as simply avoiding completely unnecessary pessimization.

At least in my opinion, it's also a bit easier (in this case) to use std::async to launch the threads. This lets us return a value from our thread (the count we want back) pretty easily.

So, let's start by fixing prime based on those observations:

int prime(int a, int b)
{
    int count = 0;

    if (a == 2)
        ++count;

    if (a % 2 == 0)
        ++a;

    auto check = [](int i) -> bool {
        for (int j = 3; j*j <= i; j += 2)
            if (i % j == 0)
                return false;
        return true;
    };

    for (a; a <= b; a+=2) {
        if (check(a))
            ++count;
    }
    return count;
}

Now, let me point out that this is already enough faster (even single-threaded) that if we just wanted to get the job to finish 4 times faster (or so) that we'd get from perfect thread-scaling, we're done, even without using threading at all. For the limit you gave, this finishes in well under 1 second.

For the sake of argument, however, let's assume we want to get more, and make use of multiple cores too. One thing to realize here is that we generally want at least a few more threads than cores. The problem is fairly simple: with only one thread per core, we have nothing to make up for the fact that we haven't really distributed the load even between the threads--the thread processing the largest numbers has quite a bit more work to do than the thread processing the smallest numbers--but if we have (for example) a 4-core machine, as soon as one thread finishes, we can only use 75% of the CPU. Then when another thread finishes, it drops to 50%. Then 25%, and finally it finishes, using only one core.

We could probably do some computation to attempt to distribute the load more evenly, but it's a lot easier to just split the load into, say, six or 8 times as many threads as cores. This way the computation can continue using all the cores until there are only three threads remaining¹.

Putting all that into code, we can end up with something like this:

int main() {
    using namespace chrono;

    int limit = 50000000;
    unsigned int nthreads = 8 * thread::hardware_concurrency();

    cout << "\nComputing multi-threaded:\n";
    cout << "Number of threads: " << nthreads << "\n";
    cout << "Calculating number of primes less than " << limit << "... \n";

    auto start2 = high_resolution_clock::now();

    vector<future<int>> threads;

    int intrvl = limit / nthreads;

    for (int i = 0; i < nthreads; i++)
        threads.emplace_back(std::async(std::launch::async, prime, i*intrvl + 1, (i + 1)*intrvl));

    int primes = 0;
    for (auto &t : threads)
        primes += t.get();

    auto end2 = high_resolution_clock::now();

    cout << "Primes: " << primes << ", Time: " << duration_cast<milliseconds>(end2 - start2).count() << "\n";
}

Note a couple of points:

This runs enough faster that I've increased the upper limit by a fairly large factor, so it'll run long enough that we can at least see it use 100% of the CPU time for a few seconds before it's done².
I've added some timing code to get a little more accurate idea of how long it runs for.

At least when I run this, it seems to act about as we'd expect/hope: it uses 100% of the CPU time until it gets very close to the end, when it starts to drop just before finishing (i.e., when we have fewer threads to execute than we have cores to execute them).

In case you wonder how OpenMP avoids this: it usually uses a thread pool, so some number of iterations of the loop is dispatched to the thread pool as a task. This lets it produce a large number of tasks without having a huge number of threads contending for CPU time simultaneously.
With the upper limit you used, it finished on my machine in about 90 milliseconds, which isn't long enough for it to even make a noticeable blip on the CPU usage graph.

John D · Answer 3 · 2016-06-05T02:38:51.970

The OpenMP example is using "reduction" on the sum variable primes which means that each task sums its own local primes variable. OpenMP adds the thread local copies of primes together at the end of the parallel part to get the grand total. That means it does not need to lock. As @Sam says, a thread will get put to sleep if it cannot acquire the mutex lock. So in your case, the threads will spend a fair amount of time asleep. If you don't want to use OpenMP, try static std::atomic<int> primes = 0; then you don't need the mutex lock and unlock.

Or you could simulate OpenMP reduction by using an array primes[numThreads] where thread i sums into primes[i] then sum primes[] at the end.

Cannot get 100% CPU usage from thread class

3 Answers3

Difference #1

Difference #2

Looking to the future...