0

This question is related to:

c++ std::async : faster on 4 cores compared to 8 cores

In the previous question, I was wondering why some code would run faster on 4 cores rather than 8 (answer: my cpu had 4 cores and 8 threads)

Now I am discovering that code is more or less the same speed independently of the number of cores used.

I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU @ 1.80GHz × 8

Here code for benchmarking computation time against number of core used

#include <math.h>
#include <future>
#include <ctime>
#include <vector>
#include <iostream>

#define NB_JOBS 2000.0
#define MAX_CORES 8

// no special meaning to this function, 
// just uses some CPU
static bool _expensive(int nb_jobs){
  for(int job=0;job<nb_jobs;job++){
    float x = 0.6;
    bool b = true;
    double f = 1;
    for(int i=0;i<1000;i++){
      if(!b) f=-1;
      for(double j=1;j<2.0;j+=0.01) x+= f* pow(1.0/sin(x),j);
      b = !b;
    }
  }
  return true;
}

static double _duration(int nb_cores){

  std::clock_t begin = clock();

  int nb_jobs_per_core = rint ( NB_JOBS / (float)nb_cores );

  std::vector < std::future<bool> > futures;
  for(int i=0;i<nb_cores;i++){
    futures.push_back( std::async(std::launch::async,_expensive,nb_jobs_per_core));
  }
  for (auto &e: futures) {
    bool foo = e.get();
  }

  std::clock_t end = clock();

  double duration = double(end - begin) / CLOCKS_PER_SEC;
  return duration;

}


int main(){

  for(int nb_cores=1 ; nb_cores<=MAX_CORES ; nb_cores++){

    double duration = _duration(nb_cores);
    std::cout << nb_cores << " threads: " << duration << "\n";

  }

  return 0;

}

Here the output:

1 threads: 8.55817
2 threads: 8.76621
3 threads: 7.90191
4 threads: 8.4656
5 threads: 10.5494
6 threads: 11.6175
7 threads: 21.697
8 threads: 24.3621

using cores seems to have marginal impacts.

What troubles me is that the CPU has 4 cores. So I was expecting the program to run (around) 4 times faster when using 4 threads. It does not.

Note: "htop" shows usage of virtual cores as expected by the program, i.e. first one core used at 100%, then 2, ..., and at the end 8.

If I replace:

futures.push_back( std::async(std::launch::async,[...]

by :

futures.push_back( std::async(std::launch::async|std::launch::deferred,[...]

then I get:

1 threads: 8.6459
2 threads: 8.69905
3 threads: 10.7763
4 threads: 11.4505
5 threads: 11.8426
6 threads: 10.4282
7 threads: 9.55181
8 threads: 9.05565

and htop shows only 1 virtual core being used 100% during the full duration.

Anything I am doing wrong ?

note: I tried on several desktops, all with various specs (nb of core and nb of threads), and observed something similar.

Vince
  • 3,979
  • 10
  • 41
  • 69
  • You are not measuring the correct time. `clock` returns CPU time, not wall time. Time spent on separate cores is added up so you may get more seconds that are actually spent according to a wall clock. – n. m. could be an AI Dec 28 '17 at 10:10
  • Possible duplicate of [clock function in C++ with threads](https://stackoverflow.com/questions/35592502/clock-function-in-c-with-threads) – n. m. could be an AI Dec 28 '17 at 10:11
  • @n.m. "You are not measuring the correct time. clock returns CPU time, not wall time": yes, but the relative time (in seconds) matches (e.g. it does take them same time if I use 1 or 2 cores) – Vince Dec 28 '17 at 10:42
  • @n.m. why would this be a duplicate ? I have no issue/question about clock – Vince Dec 28 '17 at 10:48
  • Same amount of work takes same amount of CPU time regardless of the number of cores used. – n. m. could be an AI Dec 28 '17 at 10:50
  • @n.m. indeed, I am not sure what clock shows, but in practice, the number shown id correspond to the number of seconds required for the program to finish. i.e. it did take 3 times more time using 8 cores, and when "launch::deferred" used, it did take more or less the same time (measured in seconds) for the program to finish independently of the number of cores used – Vince Dec 28 '17 at 10:59
  • "the number shown id correspond to the number of seconds required for the program to finish" Did you measure with a stopwatch? – n. m. could be an AI Dec 28 '17 at 11:03
  • @n.m. yes. I will update the post with this info to avoid the confusion. Thanks for pointing this out. – Vince Dec 28 '17 at 11:05
  • In my experience your program with `std::launch::async` runs for a total of 10 seconds while printing 30 seconds of CPU time. With `std::launch::deferred` it runs for 22 seconds while printing 22 seconds of CPU time. This is more or less as expected. I don't understand what troubles you. – n. m. could be an AI Dec 28 '17 at 11:12
  • @n.m. I have 4 cores. I was hoping to have my program running (around) 4 times faster using 4 threads. It does not. It runs for the same duration on 1, 2, 3 or 4 threads. – Vince Dec 28 '17 at 11:17
  • "It runs for the same duration on 1, 2, 3 or 4 threads" I don't see any evidence of that. Please take a look [here](http://coliru.stacked-crooked.com/a/7c7cefe971ecafdf). It's your program modified to print both CPU times and wall clock times. Also the `time` command is used to show the same times. Do you have a question regarding the numbers it produces? Is behaviour on your system similar or radically different (e.g. CPU time and elapsed are very similar)? – n. m. could be an AI Dec 28 '17 at 11:38
  • @n.m. I fear we are not talking about the same thing. I run one thread, it takes n seconds. I run in 4 threads, it runs the same n seconds. Why doesn't it run n/4 seconds ? "I don't see any evidence of that" : this is what my clockwatch is telling me. Using "clock" was not a good idea, I can change the code to use "std::chrono::system_clock::now()" to avoid confusion, but based on what my iphone stopwatch is telling me, this will not change the question I have. – Vince Dec 28 '17 at 11:55
  • Stopwatch will tell the true story, and so will the `time` command. Have you tried the `time` command? Please take a look at the link I have sent you. Run your program using `time`. Show the output, for both `std::launch::async` and `std::launch::deferred` versions. Lines printed by your program are expected to be roughly the same, it's completely normal, that's what `std::clock` is expected to show, so please don't base your conclusions on these lines. – n. m. could be an AI Dec 28 '17 at 12:13
  • Please do answer questions I'm asking, I'm trying to troubleshoot your problem and I cannot do so without some input from you. – n. m. could be an AI Dec 28 '17 at 12:16
  • @n.m. ooooooh, my bad. I do not know how I managed to get so confused, but indeed using time instead of clock shows the "expected" result. I guess I totally messed up with my clock watch yesterday (did I forget to compile after a modification ?). I sincerely apologize for this stupid mistake and for wasting your time. Thanks for your patience. – Vince Dec 28 '17 at 12:33
  • @n.m. ok, I am not completely crazy, what happened is that the issue was real with the complex program I was using, but disappeared when simplifying for this post. https://stackoverflow.com/questions/48010533/c-async-how-to-shuffle-a-vector-in-multithread-context – Vince Dec 28 '17 at 15:59

0 Answers0