2

I want to investigate how much faster a task can be done with multithreading compared to single threading depending on the task size

I plot a graph showing:

  • x_axis: how fast a task is done on a single thread.
  • y-axis: how much faster the same task is done on two threads.

What I would expect to happen:

  • If the task gets longer, the overhead of creating the threads gets less important. so the ratio (t_single / t_multi) increases
  • as I use two threads I would expact the ratio (t_single / t_multi) to converge to 2 (two threads => twice as fast as one thread)

What I get:

  • a peak at a single-thread-task time of 10e-2 sec
  • the peak is at 2.5 (multiprocessing 2.5 times faster than single threading)

How can this be explained?

The plot is created with a mean over 10 meassurements. I run it on a 24-core-linux-machine.

enter image description here

CODE:

#include <string>
#include <iostream>
#include <thread>
#include <vector>
#include <ctime>
#include <math.h>
#include <chrono>

using namespace std;
using namespace std::chrono;

// function searches through vector and adds 1
// to the first element that equals 0
void task(int number)
{
    int s = 0;
    for(int i=0; i<number; i++){
        s = s + i;
    }
    // cout << "the sum is " << s << endl;
}

double get_time_single(int m){

    // init
    int n_threads = 2;
    int n = pow(10, m);

    high_resolution_clock::time_point start = high_resolution_clock::now();

    for(int jobs = 0; jobs < n_threads; jobs++){
        task(n);
    }

    high_resolution_clock::time_point end = high_resolution_clock::now();
    double time_single = duration<double, std::milli>(end - start).count();

    return time_single;
}

double get_time_multi(int m){

    // init
    int n_threads = 2;
    int n = pow(10, m);
    vector<thread> threads;

    high_resolution_clock::time_point start = high_resolution_clock::now();

        // execute threads
        for( int i = 1; i < n_threads + 1; i++ ){
            threads.push_back(thread(task, n));
        }

        // joint threads
        for( int i = 0; i < n_threads; i++ ){
            threads.at(i).join();
        }

        high_resolution_clock::time_point end = high_resolution_clock::now();
        double time_multi = duration<double, std::milli>(end - start).count();

    return time_multi;
}


int main()
{

    // print header of magnitude - multi-proc-time - single-proc-time table
    cout << "mag" << "\t" << "time multi" << "  \t" << "time single" << endl;
    cout << "-------------------------------------" << endl;

    // iterate through different task magnitudes
    for(int m = 3; m<10; m++){

        double t_single = 0;
        double t_multi  = 0;

        // get the mean over 10 runs
        for(int i = 0; i < 10; i++){
            t_multi = t_multi + get_time_multi(m);
            t_single = t_single + get_time_single(m);
        }

        t_multi = t_multi / 10;
        t_single = t_single / 10;

        cout << m << "\t" << t_multi << "  \t" << t_single << endl;

    }
}

OUTPUT:

mag     time multi      time single
-------------------------------------
3       0.133946        0.0082684
4       0.0666891       0.0393378
5       0.30651         0.681517
6       1.92084         5.19607
7       18.8701         41.1431
8       195.002         381.745
9       1866.32         3606.08
Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
Oliver Wilken
  • 2,654
  • 1
  • 24
  • 34
  • The answer is, as usually: Don't measure the runtime of unoptimized builds because the results are meaningless. With optimization enabled, your `task` will compile to `void task(int){}`. So tl;dr: You are measuring nonsense, get a better example. – Baum mit Augen Jun 11 '17 at 13:26
  • I disabled optimisation with the g++ compiler flat `-O0` to disable compiler optimizations (as suggested [here](https://stackoverflow.com/a/5765916/4391129)). I still get the same result. – Oliver Wilken Jun 11 '17 at 13:47
  • 1
    You are not supposed to disable optimization when benchmarking, the results are completely meaningless. No one cares about the performance of debug builds, including the people implementing the language. – Baum mit Augen Jun 11 '17 at 13:48
  • @BaummitAugen: My goal to understand how c++ handles threads. If I disable compiler optimizations I can make sure that the function `void task(int)` is completely executed every time it is called. Do you agree with this? If yes, the question of what is happening in the background remains. – Oliver Wilken Jun 11 '17 at 14:01
  • 1
    There is all kinds of stuff going on in debug builds. You are measuring noise and try to find patterns. That is a waste of time. – Baum mit Augen Jun 11 '17 at 14:05

1 Answers1

1

So, you getting peak MT performance when your task completes in 5ms? In Linux, max. time slice is sysctl_sched_latency usually 6ms, probably related.

A couple more things about your setup.

When doing micro benchmarks, people usually use fastest value, not average.

Also it’s a bad idea to code these outer loops in C++, because CPU caches (both data caches, and microops cache). Better, pass parameters in a command-line arguments, write a script to call you app multiple times and gather the results somewhere.

Update: and generally, the ideal task time per thread is the maximum you can afford while using all CPU cores all the time, and meeting other requirements (such as latency).

Soonts
  • 20,079
  • 9
  • 57
  • 130