Pushing in min heap with std::priority_queue is the bottleneck

Question

Something faster than std::priority_queue as a min heap?

The original question is here. You can resolve the names generated by grpof with the demangler. With the help of the users there, I came to a conclusion, where this block of code (that I expect to get executed much more times than I will perform a pop):

/**
 * Min_heap is actually a std::priority_queue,
 * with std::greater as a parameter.
 */
typedef std::priority_queue<std::tuple<float, int, int>,
    std::vector<std::tuple<float, int, int> >,
    std::greater<std::tuple<float, int, int> > > Min_heap;
...
void nn(..) { // the core function of my project
  if(...) {
    branch.push(std::make_tuple(new_dist, other_child_i, tree_i));
  }
}

seems to be the bottleneck of my project (I think), as I can see after calling this:

gprof -q_ZN3RKDI24Division_Euclidean_spaceIfEE2nnEjRKSt6vectorIfSaIfEERKfRS3_ISt4pairI‌fiESaISB_EERiiPjRSt14priority_queueISt5tupleIJfiiEES3_ISJ_SaISJ_EESt7greaterISJ_E‌ES9_RKjS7_S7_i geraf gmon.out > analysis.txt

and got this:

granularity: each sample hit covers 2 byte(s) for 0.01% of 125.47 seconds

index % time    self  children    called     name
                             1967195             _ZN3RKDI24Division_Euclidean_spaceIfEE2nnEjRKSt6vectorIfSaIfEERKfRS3_ISt4pairIfiESaISB_EERiiPjRSt14priority_queueISt5tupleIJfiiEES3_ISJ_SaISJ_EESt7greaterISJ_EES9_RKjS7_S7_i [2]
              105.54    0.09  320000/320000      Auto_random_kd_forest<float>::Auto_random_kd_forest(unsigned int&, unsigned int&, std::string const&, unsigned int, std::string const&, int, float, std::vector<std::vector<std::pair<float, int>, std::allocator<std::pair<float, int> > >, std::allocator<std::vector<std::pair<float, int>, std::allocator<std::pair<float, int> > > > >&, Params*, int) (1)
[2]     84.2  105.54    0.09  320000+1967195 _ZN3RKDI24Division_Euclidean_spaceIfEE2nnEjRKSt6vectorIfSaIfEERKfRS3_ISt4pairIfiESaISB_EERiiPjRSt14priority_queueISt5tupleIJfiiEES3_ISJ_SaISJ_EESt7greaterISJ_EES9_RKjS7_S7_i [2]
                0.08    0.00 1967195/2031195     void std::__push_heap<__gnu_cxx::__normal_iterator<std::tuple<float, int, int>*, std::vector<std::tuple<float, int, int>, std::allocator<std::tuple<float, int, int> > > >, int, std::tuple<float, int, int>, std::greater<std::tuple<float, int, int> > >(__gnu_cxx::__normal_iterator<std::tuple<float, int, int>*, std::vector<std::tuple<float, int, int>, std::allocator<std::tuple<float, int, int> > > >, int, int, std::tuple<float, int, int>, std::greater<std::tuple<float, int, int> >) [9]
                0.01    0.00   12000/12000       _ZNSt6vectorISt5tupleIJfiiEESaIS1_EE19_M_emplace_back_auxIJS1_EEEvDpOT_ [11]
                             1967195             _ZN3RKDI24Division_Euclidean_spaceIfEE2nnEjRKSt6vectorIfSaIfEERKfRS3_ISt4pairIfiESaISB_EERiiPjRSt14priority_queueISt5tupleIJfiiEES3_ISJ_SaISJ_EESt7greaterISJ_EES9_RKjS7_S7_i [2]
-----------------------------------------------
                0.00    0.00   64000/2031195     _ZSt13__adjust_heapIN9__gnu_cxx17__normal_iteratorIPSt5tupleIJfiiEESt6vectorIS3_SaIS3_EEEEiS3_St7greaterIS3_EEvT_T0_SC_T1_T2_ (10)
                0.08    0.00 1967195/2031195     _ZN3RKDI24Division_Euclidean_spaceIfEE2nnEjRKSt6vectorIfSaIfEERKfRS3_ISt4pairIfiESaISB_EERiiPjRSt14priority_queueISt5tupleIJfiiEES3_ISJ_SaISJ_EESt7greaterISJ_EES9_RKjS7_S7_i [2]
[9]      0.1    0.09    0.00 2031195         void std::__push_heap<__gnu_cxx::__normal_iterator<std::tuple<float, int, int>*, std::vector<std::tuple<float, int, int>, std::allocator<std::tuple<float, int, int> > > >, int, std::tuple<float, int, int>, std::greater<std::tuple<float, int, int> > >(__gnu_cxx::__normal_iterator<std::tuple<float, int, int>*, std::vector<std::tuple<float, int, int>, std::allocator<std::tuple<float, int, int> > > >, int, int, std::tuple<float, int, int>, std::greater<std::tuple<float, int, int> >) [9]
-----------------------------------------------
                0.01    0.00   12000/12000       _ZN3RKDI24Division_Euclidean_spaceIfEE2nnEjRKSt6vectorIfSaIfEERKfRS3_ISt4pairIfiESaISB_EERiiPjRSt14priority_queueISt5tupleIJfiiEES3_ISJ_SaISJ_EESt7greaterISJ_EES9_RKjS7_S7_i [2]
[11]     0.0    0.01    0.00   12000         _ZNSt6vectorISt5tupleIJfiiEESaIS1_EE19_M_emplace_back_auxIJS1_EEEvDpOT_ [11]

I am pushing way more items than I pop. In the question I linked I show the code that is relative with my heap. We can assume that the heap never gets empty. I am not sure about the distribution.

How you're using the heap and how big it is getting will dramatically affect the relative performance of various kinds of heaps. (as will the quality of implementation and the particulars of your CPU) Are you doing only pushes and pops? How large does it get? You're not emptying it? Are all of the things you put on the heap uniformly random, or do they show some sort of trend, such as slowly increasing or decreasing on average over time? — , Apr 28 '15 at 09:43
are you sure that the heap is actually your bottleneck and not for instance deep copying the objects with the std::make_tuple? — cerkiewny, Apr 28 '15 at 13:48
No I am not sure, that's why I posted what gprof gave me @cerkiewny. — gsamaras, Apr 28 '15 at 13:49
because it looks like the push_heap took 0.08 s while the division euclidean space took 105 so it must be a body of division euclidean space that is time consuming not the push_heap — cerkiewny, Apr 28 '15 at 13:51
@cerkiewny the push heap is inside the body of this function. — gsamaras, Apr 28 '15 at 13:52
yes but usually output of profiler shows the cumulated time of each call so it looks like the total time spent in the body of your function is 105s out of which only 0.08 is inside the push_heap, leaving the rest in the unknown for us parts of your algorithm. In other words, for the stack trace of when the push heap is encountered you have something way more expensive then a heap_push itself. — cerkiewny, Apr 28 '15 at 13:56
I would really like to know how to produce a better output @cerkiewny! — gsamaras, Apr 28 '15 at 13:56
https://sourceware.org/binutils/docs/gprof/Flat-Profile.html#Flat-Profile flat profile would be much more valuable for you — cerkiewny, Apr 28 '15 at 13:58
You mean profiling the whole code? This has been done as one can see in my linked question! — gsamaras, Apr 28 '15 at 14:01

score 2 · Accepted Answer · edited May 23 '17 at 10:27

2

See the heap comparison on the wiki:

http://en.wikipedia.org/wiki/Heap_(data_structure)

The fastest you can get is Fibonacci heap. AFAIK the STL priority queue is just a binary heap. You could improve on speed by using the boost implementation of the fibonacci heap. Also here is a question from SO showing the usage of the boost heap.

The thing worth mentioning is that the data shown in the wikipedia is theoretical comparison of heaps. The STL implementation is the binary heap because it is usually much faster then fibonacci heap for heaps of small size. The nice question summarising the fibonacci heap information is here.

edited May 23 '17 at 10:27

Community

1
1

answered Apr 28 '15 at 09:17

cerkiewny

2,761
18
36

Good point. Putting boost in my project is something I do not like. As a matter of fact, I was forced to use it for version 1.0 and was way to happy to get away from it in the next versions of my project. – gsamaras Apr 28 '15 at 13:28
You can always try to find other implementation of the fibonacci heap, implement it yourself or strip it out from boost. – cerkiewny Apr 28 '15 at 13:33
If I knew where to find, I wouldn't ask. :) Thanks! – gsamaras Apr 28 '15 at 13:33
https://github.com/robinmessage/fibonacci there is one example it is with tests so has a reasonable chance of working – cerkiewny Apr 28 '15 at 13:34
Nice, +1. Maybe I will accept your answer too if nothing better comes up! – gsamaras Apr 28 '15 at 13:38
However, the link you posted says: This implementation should show how Fibonacci heaps work; it is not intended to be highly performant. – gsamaras Apr 28 '15 at 13:50

Pushing in min heap with std::priority_queue is the bottleneck

1 Answers1

Linked