TLDR: You can get a pretty good idea about hotspots with millisecond resolution, but nanosecond resolution doesn't work for various reasons.
You can probably find or write some function that gives you the best resolution your computer can provide, however, this still doesn't give you any meaningful results:
auto start = getBestPrecisionTime();
foo();
auto end = getBestPrecisionTime();
std::cout << "foo took " << to_nanoseconds(end - start) << "ns";
The first issue is that foo()
gets interrupted by another program and you are not actually measuring foo()
but foo()
+ some_random_service. A way to get around that is to make 1000 measurements, hope that at least one of them didn't get interrupted and take the minimum of those measurements. Depending on how long foo()
actually takes your chances are anywhere from always to never.
Similarly foo()
probably accesses memory which is somewhere in level 1/2/3/4 cache, RAM or on the harddrive, so again you are measuring the wrong thing. You would need to get real world data of how likely it is that memory that foo()
needs is in which memory and has which access times.
Another major issue is optimization. It doesn't make much sense to measure the performance of a debug version, so you will want to measure with maximum optimization enabled. With a high optimization level the compiler will reorder and inline code. The getBestPrecisionTime
function has two options: Allow the compiler to move code past it or not. If it allows reordering the compiler will do this:
foo();
auto start = getBestPrecisionTime();
auto end = getBestPrecisionTime();
std::cout << "foo took " << to_nanoseconds(end - start) << "ns";
and then optimize it further to
std::cout << "foo took 0ns";
Obviously this produces wrong results and all timing functions I have come across add barriers to disallow this.
But the alternative is not much better. Without the measurement the compiler may optimize this
foo();
bar();
into
code_that_does_foo_bar;
which is more efficient due to better utilization of registers/SIMD instructions/caching/.... But once you measure the performance you disabled this optimization and you measure the wrong version. With a lot of work you may be able to extract which assembler instructions inside code_that_does_foo_bar
originated from foo()
, but since you can't even tell exactly how long an assembler instruction takes and that time also depends on surrounding assembler instructions you have no chance to get an accurate number for optimized code.
The best you can do is just use std::chrono::high_resolution_clock
because it just doesn't get much more precise.