perf results vs. timer results mismatch

Question

I'm profiling a piece of my code related to push and pop to/from priority_queue in C++. I'm using both perf record and also using std::chrono::high_resolution_clock::now() to measure the time my code is spending in priority_queue operations (push, pop, top). The results are vastly different. perf shows that the overhead of std::__adjust_heap is 22% vs. using the clock, it shows that the total heap operation takes around 0.5% of my total run-time. My first hunch was that the instructions are being re-ordered and hence the timer is not measuring the time for the critical section of the code.

Here is the piece of the code:

std::priority_queue<Packet *> exec_pkts_;
std::chrono::microseconds               heap_time_;
Packet * pkt;
//......Some other code......
#ifdef HEAP_PROFILE
    auto start = std::chrono::high_resolution_clock::now();
#endif

    exec_pkts_.push(pkt);

#ifdef HEAP_PROFILE
    auto stop = std::chrono::high_resolution_clock::now();
    heap_time_ += std::chrono::duration_cast<std::chrono::microseconds>(stop-start);
#endif

I use memory barriers as mentioned in Enforcing statement order in C++. I disassembled my code using objdumpt -S -D option and looking at assembly code, I see that callq 401290 <_ZNSt6chrono3_V212system_clock3nowEv@plt> happens correctly before and after the critical section of the code.

Is there something obvious I'm missing. What else can explain the drastic difference in performance numbers?

how long does each call to `exec_pkts_.push` take? `std::chrono::high_resolution_clock::now()` has an overhead and therefore limited accuracy, truncating the times to mico seconds won't help with that — Alan Birtles, Dec 09 '21 at 18:49
A tactical note from one of the people most responsible for the Modern C++ time keeping libraries: [What are the uses of std::chrono::high_resolution_clock?](https://stackoverflow.com/a/37440647/4581301) — user4581301, Dec 09 '21 at 18:51
@Alan Per call it shows around 1-2ns (sum of measurements / number of calls). The overhead is actually considerable, but I'm just trying to get an idea about the time the code is spending within the critical section (heap operations). I'm not using the instrumented code to get the time for the overall code. — Amir, Dec 09 '21 at 19:37
If `stop-start` is less than a micro second then `std::chrono::duration_cast(stop-start)` will be zero — Alan Birtles, Dec 09 '21 at 19:40
Thanks a lot @AlanBirtles. That was exactly the issue. When I fixed it, now the numbers showing are pretty large compared to what I see in perf record. Now questioning even if using chrono is the right approach or I can 100% rely on perf record --call-graph to give me the correct numbers for each function. I've seen posts indicating that chrono may not be as useful because of all complications with enforcing certain ordering onto compiler, etc. Do you have any thoughts on that? — Amir, Dec 09 '21 at 23:03
Chrono (though use steady clock not system clock) is great for high level bench marking but if you want to do micro benchmarking reading the clock becomes a significant part of your processing time so your measurements become inaccurate. You can improve slightly by reading TSC which is generally faster than reading clocks but i think you still won't approach the performance of perf which is implemented with hardware counters in the kernel — Alan Birtles, Dec 09 '21 at 23:18

perf results vs. timer results mismatch

0 Answers0