2

Suppose I have lots of computations that I want to run (and benchmark CPU time) in multiple threads. As a toy example:

#include <chrono>
#include <future>
#include <iostream>
#include <vector>


using unit_t = std::chrono::nanoseconds;

unit_t::rep expensive_computation() {
    auto start = std::chrono::steady_clock::now();
    // Something time-consuming here...
    auto end = std::chrono::steady_clock::now();

    auto duration = std::chrono::duration_cast<unit_t>(end - start).count();

    return duration;
}

int main() {
    std::vector<std::future<unit_t::rep>> computations;

    for (int i = 0; i < 100; i++) {
        computations.push_back(std::async(expensive_computation));
    }

    for (size_t i = 0; i < computations.size(); i++) {
        auto duration = computations[i].get();
        std::cout << "#" << i << " took " << duration << "ns" << std::endl;
    }
}

I'm concerned that since steady_clock is montonic across threads the underlying clock ticks per process and not per thread (if any thread is scheduled the clock ticks for all threads). This would mean that if a thread were sleeping, steady_clock would still be ticking for it and this time would incorrectly be included in the duration for that thread. Is my suspicion correct? Or does steady_clock tick only for thread CPU time within a thread?

Put another way, is this approach a safe way to independently time lots of computations (such that no CPU time spent on one thread will affect the duration of another thread)? Or do I need to spin off separate processes for each computation to make the steady_clock only tick when the computation is running/scheduled?

edit: I also recognize that spinning up more threads than cores may be an inefficient approach to this problem (although, I don't particularly care about computation throughput; moreover, I just want them all as a group to complete in the fastest time). I suspect in practice, I'd need to maintain a small-constant bounded list of threads in flight (say capped at the number of cores) and only start new computations as a core becomes available. But, this shouldn't have an impact on timing that I care about above; it should only affect the wall clock time.

Bailey Parker
  • 15,599
  • 5
  • 53
  • 91
  • 1
    I think you should measure on a 100% idle machine. In this case it doesn't really matter which clock you use (if you measure things on a loaded machine, your benchmark could be inaccurate, even if you measure thread time: HyperThreading, cache usage, etc. could affect results). – geza Aug 26 '18 at 15:26
  • A future ready check should improve the code before the future.get(). At ns resolution which is also CPU domain, a check regarding minimum resolution with the system may be essential. – seccpur Aug 26 '18 at 15:27
  • @geza This is a very good point. I'm currently working to secure an environment like that. One of the concerns I'd have even on a 100% idle machine is that some computations take significantly longer than others (and those would perhaps be more impacted by the load, potentially artificially increasing their time). But I suspect this is largely unavoidable. – Bailey Parker Aug 26 '18 at 15:37
  • @seccpur What do you mean by a "future ready check?" Does `future.get()` not call `future.wait()`? In the toy example, I'm not particularly concerned with the latency of results (just the overall runtime). Also, the ns resolution was an artifact of this being a toy example. My real benchmark resolution will probably be μs or ms. – Bailey Parker Aug 26 '18 at 15:39
  • Say one of the computation is an exceptionally long one, before the computation is over, if future.get() is call that will be UB. – seccpur Aug 26 '18 at 15:43
  • 1
    @BaileyParker:"One of the concerns I'd have even on a 100% idle machine is that some computations take significantly longer than others". What do you mean by this? (just a note: remember to turn off CPU frequency scaling). – geza Aug 26 '18 at 15:49
  • @geza The longer computations (say at least an order of magnitude or two longer) would then be exposed to the effects of hyperthreading, caching etc. for longer and this would artificially inflate their runtime by some overhead that is a function of their true runtime. So, short and long running computations would still have comparable runtimes, but it would be less meaningful to say that a short was 100x faster, because when run serially that may not be the case. At least, that's my intuition. – Bailey Parker Aug 26 '18 at 15:54
  • @BaileyParker: hmm. If you measure two operations in the same environment, then it is fine to say that one is 100x faster, even if there are "cache effects". This cannot be avoided. What I meant by cache is that if your machine is under load, then a task switch can clear the cache. So when your thread gets run again, then it will have to reload the cache. If there was no load, then your thread doesn't get interrupted, so the cache will be always "warm". HyperThreading can affect measurement as well: it matters a lot that your thread gets a "full" core, or just a "HyperThreaded" one. – geza Aug 26 '18 at 16:04

2 Answers2

3

The standard specifies that steady_clock model physical time (as opposed to CPU time).

From [time.clock.steady]:

Objects of class steady_clock represent clocks for which values of time_point never decrease as physical time advances and for which values of time_point advance at a steady rate relative to real time. That is, the clock may not be adjusted.

That being said, how well an implementation models physical time is a QOI issue. Nevertheless, your code looks fine to me.

Should your experimentations prove unsatisfactory, clients of <chrono> can also author their own custom clocks that will have first class status within the <chrono> library.

Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • Oops, seems like I misunderstood here then. I will try it out with the real computations and see if this is satisfactory (although, determining if there is interference will likely prove tricky). While researching, I did come across [`pthread_getcpuclockid`](http://man7.org/linux/man-pages/man3/pthread_getcpuclockid.3.html) so perhaps a custom clock wrapping that could be more accurate? As @geza notes in the comments though, my goal of having equivalent times to serial execution is likely unachievable given caching, hyperthreading etc. Thanks for you insight! – Bailey Parker Aug 26 '18 at 15:33
2

This would mean that if a thread were sleeping, steady_clock would still be ticking for it and this time would incorrectly be included in the duration for that thread.

That won't be incorrectly though, as the standard specifies for class std::chrono::steady_clock that it measures physical time, not CPU time or any other time. See here under [time.clock.steady]:

Objects of class steady_­clock represent clocks for which values of time_­point never decrease as physical time advances and for which values of time_­point advance at a steady rate relative to real time ...

That said, your code looks fine in that it will give you the time measured for each thread run. Is there a reason for you to want to measure CPU time here? If so then let me know in the comments.

Geezer
  • 5,600
  • 18
  • 31
  • Thanks for the standard links. I will read up! Physical time perhaps is okay (in my specific case, computations do no I/O). My real goal here is that the timings produced should be equivalent to if all the computations had been run and timed serially. As long as this is the case, then that is fine. – Bailey Parker Aug 26 '18 at 15:26
  • @BaileyParker Yes, this is the case. Glad I could help. – Geezer Aug 26 '18 at 15:27
  • @BaileyParker Seeing from your comment to the other answer, I may not have fully understood your requirement for accuracy here. Your code in the post isn't equivalent *mathematically* to running the same tasks serially. Just equivalent in terms of measuring total time to execute all the tasks. – Geezer Aug 26 '18 at 16:05