Ideal task time per thread?

Question

I want to investigate how much faster a task can be done with multithreading compared to single threading depending on the task size

I plot a graph showing:

x_axis: how fast a task is done on a single thread.
y-axis: how much faster the same task is done on two threads.

What I would expect to happen:

If the task gets longer, the overhead of creating the threads gets less important. so the ratio (t_single / t_multi) increases
as I use two threads I would expact the ratio (t_single / t_multi) to converge to 2 (two threads => twice as fast as one thread)

What I get:

a peak at a single-thread-task time of 10e-2 sec
the peak is at 2.5 (multiprocessing 2.5 times faster than single threading)

How can this be explained?

The plot is created with a mean over 10 meassurements. I run it on a 24-core-linux-machine.

CODE:

#include <string>
#include <iostream>
#include <thread>
#include <vector>
#include <ctime>
#include <math.h>
#include <chrono>

using namespace std;
using namespace std::chrono;

// function searches through vector and adds 1
// to the first element that equals 0
void task(int number)
{
    int s = 0;
    for(int i=0; i<number; i++){
        s = s + i;
    }
    // cout << "the sum is " << s << endl;
}

double get_time_single(int m){

    // init
    int n_threads = 2;
    int n = pow(10, m);

    high_resolution_clock::time_point start = high_resolution_clock::now();

    for(int jobs = 0; jobs < n_threads; jobs++){
        task(n);
    }

    high_resolution_clock::time_point end = high_resolution_clock::now();
    double time_single = duration<double, std::milli>(end - start).count();

    return time_single;
}

double get_time_multi(int m){

    // init
    int n_threads = 2;
    int n = pow(10, m);
    vector<thread> threads;

    high_resolution_clock::time_point start = high_resolution_clock::now();

        // execute threads
        for( int i = 1; i < n_threads + 1; i++ ){
            threads.push_back(thread(task, n));
        }

        // joint threads
        for( int i = 0; i < n_threads; i++ ){
            threads.at(i).join();
        }

        high_resolution_clock::time_point end = high_resolution_clock::now();
        double time_multi = duration<double, std::milli>(end - start).count();

    return time_multi;
}


int main()
{

    // print header of magnitude - multi-proc-time - single-proc-time table
    cout << "mag" << "\t" << "time multi" << "  \t" << "time single" << endl;
    cout << "-------------------------------------" << endl;

    // iterate through different task magnitudes
    for(int m = 3; m<10; m++){

        double t_single = 0;
        double t_multi  = 0;

        // get the mean over 10 runs
        for(int i = 0; i < 10; i++){
            t_multi = t_multi + get_time_multi(m);
            t_single = t_single + get_time_single(m);
        }

        t_multi = t_multi / 10;
        t_single = t_single / 10;

        cout << m << "\t" << t_multi << "  \t" << t_single << endl;

    }
}

OUTPUT:

mag     time multi      time single
-------------------------------------
3       0.133946        0.0082684
4       0.0666891       0.0393378
5       0.30651         0.681517
6       1.92084         5.19607
7       18.8701         41.1431
8       195.002         381.745
9       1866.32         3606.08

The answer is, as usually: Don't measure the runtime of unoptimized builds because the results are meaningless. With optimization enabled, your `task` will compile to `void task(int){}`. So tl;dr: You are measuring nonsense, get a better example. — Baum mit Augen, Jun 11 '17 at 13:26
I disabled optimisation with the g++ compiler flat `-O0` to disable compiler optimizations (as suggested [here](https://stackoverflow.com/a/5765916/4391129)). I still get the same result. — Oliver Wilken, Jun 11 '17 at 13:47
You are not supposed to disable optimization when benchmarking, the results are completely meaningless. No one cares about the performance of debug builds, including the people implementing the language. — Baum mit Augen, Jun 11 '17 at 13:48
@BaummitAugen: My goal to understand how c++ handles threads. If I disable compiler optimizations I can make sure that the function `void task(int)` is completely executed every time it is called. Do you agree with this? If yes, the question of what is happening in the background remains. — Oliver Wilken, Jun 11 '17 at 14:01
There is all kinds of stuff going on in debug builds. You are measuring noise and try to find patterns. That is a waste of time. — Baum mit Augen, Jun 11 '17 at 14:05

score 1 · Accepted Answer · answered Jun 11 '17 at 16:29

So, you getting peak MT performance when your task completes in 5ms? In Linux, max. time slice is sysctl_sched_latency usually 6ms, probably related.

A couple more things about your setup.

When doing micro benchmarks, people usually use fastest value, not average.

Also it’s a bad idea to code these outer loops in C++, because CPU caches (both data caches, and microops cache). Better, pass parameters in a command-line arguments, write a script to call you app multiple times and gather the results somewhere.

Update: and generally, the ideal task time per thread is the maximum you can afford while using all CPU cores all the time, and meeting other requirements (such as latency).

Ideal task time per thread?

1 Answers1