I'm profiling a piece of my code related to push and pop to/from priority_queue in C++. I'm using both perf record
and also using std::chrono::high_resolution_clock::now()
to measure the time my code is spending in priority_queue operations (push, pop, top).
The results are vastly different. perf
shows that the overhead of std::__adjust_heap
is 22% vs. using the clock, it shows that the total heap operation takes around 0.5% of my total run-time. My first hunch was that the instructions are being re-ordered and hence the timer is not measuring the time for the critical section of the code.
Here is the piece of the code:
std::priority_queue<Packet *> exec_pkts_;
std::chrono::microseconds heap_time_;
Packet * pkt;
//......Some other code......
#ifdef HEAP_PROFILE
auto start = std::chrono::high_resolution_clock::now();
#endif
exec_pkts_.push(pkt);
#ifdef HEAP_PROFILE
auto stop = std::chrono::high_resolution_clock::now();
heap_time_ += std::chrono::duration_cast<std::chrono::microseconds>(stop-start);
#endif
I use memory barriers as mentioned in Enforcing statement order in C++.
I disassembled my code using objdumpt -S -D
option and looking at assembly code, I see that callq 401290 <_ZNSt6chrono3_V212system_clock3nowEv@plt>
happens correctly before and after the critical section of the code.
Is there something obvious I'm missing. What else can explain the drastic difference in performance numbers?