19

I have a single line of code, that consumes 25% - 30% of the runtime of my application. It is a less-than comparator for an std::set (the set is implemented with a Red-Black-Tree). It is called about 180 Million times within 28 seconds.

struct Entry {
  const float _cost;
  const long _id;

  // some other vars

    Entry(float cost, float id) : _cost(cost), _id(id) {
    } 
};



template<class T>
struct lt_entry: public binary_function <T, T, bool>
{
    bool operator()(const T &l, const T &r) const
    {
        // Most readable shape
        if(l._cost != r._cost) {
            return r._cost < l._cost;
        } else {
            return l._id < r._id;
        }
    }
};

The entries should be sorted by cost and if the cost is the same by their id. I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set.

To improve performance I tried to express it in different shapes:

return l._cost < r._cost || r._cost > l._cost || l._id < r._id;

return l._cost < r._cost || (l._cost == r._cost && l._id < r._id);

Even this:

typedef union {
    float _f;
    int _i;
} flint;

//...

flint diff;
diff._f = (l._cost - r._cost);
return (diff._i && diff._i >> 31) || l._id < r._id;

But the compiler seems to be smart enough already, because I haven't been able to improve the runtime.

I also thought about SSE but this problem is really not very applicable for SSE...

The assembly looks somewhat like this:

movss  (%rbx),%xmm1
mov    $0x1,%r8d
movss  0x20(%rdx),%xmm0
ucomiss %xmm1,%xmm0
ja     0x410600 <_ZNSt8_Rb_tree[..]+96>
ucomiss %xmm0,%xmm1
jp     0x4105fd <_ZNSt8_Rb_[..]_+93>
jne    0x4105fd <_ZNSt8_Rb_[..]_+93>
mov    0x28(%rdx),%rax
cmp    %rax,0x8(%rbx)
jb     0x410600 <_ZNSt8_Rb_[..]_+96>
xor    %r8d,%r8d

I have a very tiny bit experience with assembly language, but not really much.

I thought it would be the best (only?) point to squeeze out some performance, but is it really worth the effort? Can you see any shortcuts that could save some cycles?

The platform the code will run on is an ubuntu 12 with gcc 4.6 (-stl=c++0x) on a many-core intel machine. Only libraries available are boost, openmp and tbb. The 30 second benchmark was performed on my 4yr old laptop (core 2 duo).

I am really stuck on this one, it seems so simple, but takes that much time. I have been crunching my head since days thinking how I could improve this line...

Can you give me a suggestion how to improve this part, or is it already at its best?

EDIT 1: After using Jerrys suggestion I achieved a speed up of ~4.5 seconds. EDIT 2: After trying boost Fibonacci heaps the comparison went to 174 Mio calls to the less than function.

Heye
  • 603
  • 6
  • 14
  • 25
    Comparing two numbers is the bottleneck for your application? Wow... You must be doing almost nothing in the other parts of the program. – Seth Carnegie Dec 12 '12 at 03:49
  • Have you considered a parallel sort algorithm?http://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance – Carl Dec 12 '12 at 03:52
  • If you're using this inside an `std::set`, you're not gonna get much out of trying to vectorize it with SSE. If it were integers, you might be able to get away with some bitwise hacks. But other than that, I don't see any room for micro-optimizations since `std::set` (which is essentially an untouchable black-box) is gonna get in the way. – Mysticial Dec 12 '12 at 03:54
  • 3
    You should consider the likely possibility that your profiling software is lying, and the bottleneck is a few lines up or down from this operation. – Hot Licks Dec 12 '12 at 03:54
  • May the root of problem be in comparing floating point numbers? Maybe assume some EPS comparison? Something like ABS(l.cost-r.cost) < EPS (EPS = 1e-6) means l.cost and r.cost are equal – imslavko Dec 12 '12 at 04:01
  • 2
    BTW, I think 52s for 1.8*10^8 comparisons is fair number. RB-Tree has big hidden constant and computer with P4 processor can run appx 10^7 basic operations per second. – imslavko Dec 12 '12 at 04:03
  • 3
    What exactly are you hitting on? I.e. how did you profile it? Is it branch miss prediction, cache misses? Unaligned access? I'd recommend you figure that out first using something like vtune or perf tools. Then think how to fix it. Could be worth restructuring to make it cache friendly, for example, or even make it fit in L1 (or L2). So far you are wasting memory on padding for sure. Bear in mind that you are also likely to see side effects due to other code and even if CPU sucks here, it doesn't necessarily mean this code is at fault. –  Dec 12 '12 at 04:04
  • The standard way to do the comparison would be: `return l._cost < r._cost || (l._cost == r._cost && l._id < r._id);` – Martin York Dec 12 '12 at 04:06
  • Did you turn on optimizations? `-O3` – Martin York Dec 12 '12 at 04:08
  • 2
    I don't know about speed, but for readability, I like [std::tie](http://en.cppreference.com/w/cpp/utility/tuple/tie): `return std::tie(l._cost, l.id) < std::tie(r._cost, r._id);` – Robᵩ Dec 12 '12 at 05:00
  • @Hot Licks: I am profiling using gperftools, I don't know how exact they are, but so far they seemed precise enough. – Heye Dec 12 '12 at 09:11
  • First off, most profiling tools divide the code into "windows", which may be anywhere from a few hundred bytes in size to maybe 4. The tool will only be as accurate as the window size, so use the smallest size (focused on the areas of code that seems to be "hottest") as you can. Second, the tools generally rely on getting interrupts so they can sample the instruction pointer, but particular code sequences may "hold off" interrupts, causing the following code to be "blamed". And there are other effects due to cache misses, task switching, etc. – Hot Licks Dec 12 '12 at 12:37
  • @Seth Carnegie: I am using the Set for a Dijkstra. So basically removing the best nodes and inserting all node this node links to (so log(n) calls per inserted node of the less than operator). So yes, it's mostly this line the algorithm relies on... – Heye Dec 12 '12 at 16:37
  • @Heye, ok, in that case you want a priority queue of some kind, depending on the size of your problem. The standard vector-based solution is attractive because vectors do not require allocation for each element, which makes them a lot lighter weight than tree-based solutions. However, for some size of problem (really big), the fibonacci queue's lower complexity will start being a win. (Afaics, it also does one allocation for every element.) – rici Dec 12 '12 at 17:09
  • For anyone still following along with this conversation, using a priority queue instead of a set will allow using a simple comparison of _cost, because the priority queue doesn't insist on uniqueness, unlike a set. So no additional trickery is needed for optimizing comparison. – rici Dec 12 '12 at 17:36

5 Answers5

11

I have a hard time believing that:

a) The comparison function runs 180 million times in 30 seconds

and

b) The comparison function uses 25% of the cpu time

are both true. Even a Core 2 Duo should easily be able to run 180 million comparisons in less than a second (after all, the claim is that it can do something like 12,000 MIPS, if that actually means anything). So I'm inclined to believe that there is something else being lumped in with the comparison by the profiling software. (Allocating memory for new elements, for example.)

However, you should at least consider the possibility that a std::set is not the data structure you're looking for. If you do millions of inserts before you actually need the sorted values (or maximum value, even), then you may well be better off putting the values into a vector, which is a much cheaper data structure both in time and space, and sorting it on demand.

If you actually need the set because you're worried about collisions, then you might consider an unordered_set instead, which is slight cheaper but not as cheap as a vector. (Precisely because vectors cannot guarantee you uniqueness.) But honestly, looking at that structure definition, I have a hard time believing that uniqueness is important to you.

"Benchmark"

On my little Core i5 laptop, which I suppose is not in the same league as OP's machine, I ran a few tests inserting 10 million random unique Entry's (with just the two comparison fields) into a std::set and into a std::vector. At the end of this, I sort the vector.

I did this twice; once with a random generator that produces probably unique costs, and once with a generator which produces exactly two different costs (which should make the compare slower). Ten million inserts results in slightly more comparisons than reported by OP.

              unique cost         discrete cost
           compares     time    compares     time
set       243002508    14.7s   241042920    15.6s   
vector    301036818     2.0s   302225452     2.3s

In an attempt to further isolate the comparison times, I redid the vector benchmarks using both std::sort and std::partial_sort, using 10 elements (essentially a selection of top-10) and 10% of the elements (that is, one million). The results of the larger partial_sort surprised me -- who would have thought that sorting 10% of a vector would be slower than sorting all of it -- but they show that algorithm costs are a lot more significant than comparison costs:

                     unique cost         discrete cost
                  compares     time    compares     time
partial sort 10   10000598     0.6s    10000619     1.1s
partial sort 1M   77517081     2.3s    77567396     2.7s
full sort        301036818     2.0s   302225452     2.3s   

Conclusion: The longer compare time is visible, but container manipulation dominates. The total cost of ten million set inserts is certainly visible in a total of 52 seconds of compute time. The total cost of ten million vector inserts is quite a bit less noticeable.

Small note, for what it's worth:

The one thing I got from that bit of assembly code is that you're not saving anything by making the cost a float. It's actually allocating eight bytes for the float, so you're not saving any memory, and your cpu does not do a single float comparison any faster than a single double comparison. Just sayin' (i.e., beware of premature optimization).

Downvoter, care to explain?

rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks for the benchmark. Actually I messed up a little, the platform where the code will have to perform is the many-core intel machine (I think its 40 cores) but where I measured the calls was on my 4yr old laptop (core 2 duo). And I accidentally took the wrong time, it was 30 seconds, not 52 seconds. – Heye Dec 12 '12 at 12:07
  • @Heye, I don't think that makes much difference. Maybe my machine is faster than yours, but it's not an order of magnitude faster. If I can create a 10 million element vector and then sort it (moving all of those elements around) in two seconds, the comparisons are not taking even a second. Say they take two seconds on your machine. Two out of 30 is still a lot less than 25%. But the cost of using `set` remains. – rici Dec 12 '12 at 14:14
  • Try adding an std::string with "foo bar" to the structs you are sorting, I did 4.8 Bio calls of the lt function. Without string 1 second, with string 5.5 seconds. Using @Jerry Coffins idea it went to 1.5 seconds. Sure, the time the comparison takes is quite small, but I managed to sqeeze out about 15% improvement. Though 4.8 Bio is much more than 180 Mio, but the benchmark might be optimized in a different way and is therefore faster... – Heye Dec 12 '12 at 14:49
  • @Heye, Jerry's layout saves you eight bytes on each object, which improves locality, but it means that your ids shrink from 64 bits to 32 bits. Is that acceptable? If so, you should definitely go for it, but I'd try it without the aliasing hack and see how that goes. I presume that you mean by "add a std::string" that the string is only present in the struct, not participating in the search; even so, the struct will grow by some amount, and that could cause any number of cache effects depending on how you do the comparisons. I chose to do it with real containers because that's more realistic. – rici Dec 12 '12 at 15:11
  • @heye, and also if you can use a vector (or even an unordered_map), you might see improvements on the order of 50%. My performance advice is always "first look at algorithms; then at data structures; only then try to micromanage optimizations." YMMV, of course. – rici Dec 12 '12 at 15:13
  • OK, I tried various arrangements. You can achieve most of the benefits of Jerry's solution by just changing _id from a long to an int32_t (as I said above, this depends on your being happy with 32 bits of id); there is a significant reduction in memory usage and everything gets a little faster. Using Jerry's hack as well helps in the case of the discrete distribution, not much in the case of the continuous distribution, but the benefit for a std::set is 0.5 sec out of about 14, or 3%. Adding the std::string also increases memory usage and time, particularly noticeable for vectors. – rici Dec 12 '12 at 16:11
  • I only chose the long, because I used to simply take the pointer to an edge in the graph I'm traversing as its id. I now use an int32_t id for each edge. I think the number of edges won't be more than 10 or 20 Mio. I am trying the different heaps offered by boost, but haven't seen a big performance increase with Fibonacci heaps or binomial heaps so far. The other heaps are somehow causing compiling issues... – Heye Dec 12 '12 at 17:19
  • @Heye: if you use a priority queue rather than a set, then its ok if your comparison function effectively returns equality for equal elements. That will reduce your comparison function to a simple comparison on _cost, and you can stop mucking about with aliasing etc. I'd try to find a vector-based priority queue if I were you, because I believe node allocation costs and memory dispersion are what are killing your performance. – rici Dec 12 '12 at 17:33
11

An easy solution is precompute a sort-identifier comprised of the cost as most significant and the id as the rest.

E.g.,

struct Entry
{
    double cost_;
    long id_;
    long long sortingId_;

  // some other vars

    Entry( double cost, float id )
        : cost_( cost ), id_( id ), sortingId_( 1e9*100*cost + id )
    {} 
};

Adjust sortingId_ value based on what you can assume about the value ranges.

Then, now just sort on sortingId_.


Or as variation of the same idea, if you can't make suitable assumptions about the data, then consider preparing data especially for memcmp.


For a higher level solution, remember that std::set::insert has an overload with a hint argument. If your data is already in near sorted order, that might seriously reduce the number of calls to your comparator function.


And you might consider whether a std::unordered_set might be sufficient? I.e. whether you need to list the data in sorted order. Or if the sorting is just the internal stuff of std::set element insertion.


Finally, for other readers (the OP has made clear that he's aware of this), remember to MEASURE.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
11

Let me preface this with the fact that what I'm going to outline here is fragile and not entirely portable -- but under the right circumstances (which are pretty much what you've specified) I'm reasonably certain that it should work correctly.

One point it depends upon is the fact that IEEE floating point numbers are carefully designed so that if you treat their bit pattern as an integer, they'll still sort into the correct order (modulo a few things like NaNs, for which there really is no "correct order").

To make use of that, what we do is pack the Entry so there's no padding between the two pieces that make up our key. Then we ensure the structure as a whole is aligned to an 8-byte boundary. I've also changed the _id to int32_t to ensure that it stays 32 bits, even on a 64-bit system/compiler (which will almost certainly produce the best code for this comparison).

Then, we cast the address of the structure so we can view the floating point number and the integer together as a single 64-bit integer. Since you're using a little-endian processor, to support that we need to put the less significant part (the id) first, and the more significant part (the cost) second, so when we treat them as a 64-bit integer, the floating point part will become the most significant bits, and the integer part the less significant bits:

struct __attribute__ ((__packed__)) __attribute__((aligned(8)) Entry {
  // Do *not* reorder the following two fields or comparison will break.
  const int32_t _id;
  const float _cost;

  // some other vars

    Entry(long id, float cost) : _cost(cost), _id(id) {} 
};

Then we have our ugly little comparison function:

bool operator<(Entry const &a, Entry const &b) { 
   return *(int64_t const *)&a < *(int64_t const *)&b;
}

Once we've defined the struct correctly, the comparison becomes fairly straightforward: just take the first 64 bits of each struct, and compare them as if they were 64-bit integers.

Finally a bit of test code to give at least a little assurance that it works correctly for some values:

int main() { 
    Entry a(1236, 1.234f), b(1234, 1.235f), c(1235, 1.235f);

    std::cout << std::boolalpha;

    std::cout << (b<a) << "\n";
    std::cout << (a<b) << "\n";
    std::cout << (b<c) << "\n";
    std::cout << (c<b) << "\n";
    return 0;
}

At least for me, that produces the expected results:

false
true
true
false

Now, some of the possible problems: if the two items get rearranged either between themselves, or any other part of the struct gets put before or between them, comparison will definitely break. Second, we're completely dependent on the sizes of the items remaining 32 bits apiece, so when they're concatenated they'll be 64 bits. Third, if somebody removes the __packed__ attribute from the struct definition, we could end up with padding between _id and _cost, again breaking the comparison. Likewise, if somebody removes the aligned(8), the code may lose some speed, because it's trying to load 8-byte quantities that aren't aligned to 8-byte boundaries (and on another processor, this might fail completely). [Edit: Oops. @rici reminded me of something I intended to list here, but forgot: this only works correctly when both the _id and cost are positive. If _cost is negative, comparisons will be messed up by the fact that IEEE floating point used a signed magnitude representation. If an _id is negative, its sign bit will be treated just like a normal bit in the middle of a number, so a negative _id will show up as larger than a positive _id.]

To summarize: this is fragile. No question at all about that. Nonetheless, it should be pretty fast -- especially if you're using a 64-bit compiler, in which case I'd expect the comparison to come out to two loads and one comparison. To make a long story short, you're at the point that you probably can't make the comparison itself any faster at all -- all you can do is try to do more in parallel, optimize memory usage patterns, etc.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • One question : why do we need to align the struct to an 8-byte boundary? Just ensuring that no-padding between the two pieces is not enough? – Nawaz Dec 12 '12 at 05:57
  • `*(int64_t const *)&a` causes undefined behavior (breaks strict aliasing rule). – Andrey Vihrov Dec 12 '12 at 09:21
  • Wow, great idea! I managed to achieve a speedup by about 15%. I am using an extra struct Cmp that contains only the id and the cost, and put that as first field into the Entry struct, because the compiler warns about the packing when I have an std::vector or some other containers in the Entry struct. But this way it performs quite well. :) – Heye Dec 12 '12 at 11:59
  • 1
    @Nawaz: On an Intel processor, the 8-byte alignment isn't strictly necessasry (but may improve speed). On some processors (RISCs, mostly) trying to load an 8-byte item that isn't aligned to an 8-byte boundary can/will give an exception. – Jerry Coffin Dec 12 '12 at 12:59
  • 1
    @AndreyVihrov: What strict aliasing rules do you think it violates? [Hint: This is tagged C++, *not* C99]. The reality is that the behavior isn't defined by the standard -- but the lack of definition has little or nothing to do with strict aliasing, as long as you're compiling as C++. – Jerry Coffin Dec 12 '12 at 13:06
  • Jerry, that only works if the floating point number is non-negative, since IEEE-754 is a sign/magnitude representation. I presume that's not a problem in this case (`cost` sounds to me like a non-negative quantity), but it surely needs to be noted. – rici Dec 12 '12 at 15:22
  • @JerryCoffin: See the C++98 standard, clause 3.10.15 — it specifies that the above code is UB. – Andrey Vihrov Feb 07 '13 at 09:54
1

I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set.

This sounds to me like a typical priority-queue application. You say you just considered using a Fibonacci heap, so I guess such a priority-queue implementation would be sufficient for your needs (pushing elements, and extracting the min element one at a time). Before you go out of your way and obsess on optimizing one or two clock cycles out of that comparison function, I would suggest that you try a few off-the-shelf priority-queue implementations. Like std::priority_queue, boost::d_ary_heap (or boost::d_ary_heap_indirect for a mutable priority-queue), or any other boost heap structure.

I encountered a similar situation before, I was using a std::set in place of a priority-queue in a A*-like algorithm (and also tried a sorted std::vector with std::inplace_merge for insertions), and switching to std::priority_queue was a huge boost in performance, and then later switching to boost::d_ary_heap_indirect went the extra mile. I recommend that you at least give that a try if you haven't already.

Mikael Persson
  • 18,174
  • 6
  • 36
  • 52
  • I am trying the d_ary_heap right now, but I have some compiling problems with deleted copy constructor: – Heye Dec 12 '12 at 14:14
  • These are my constructors: `Entry(Entry &&e) : _cmp(e._cmp) {}` and `Entry(Entry &e) : _cmp(e._cmp) {}` (shortened) where `_cmp` is the struct as suggested by @Jerry Coffin. `bh::d_ary_heap, bh::compare > >` and it gives me: "Entry& Entry::operator=(const Entry&)’ is implicitly declared as deleted because ‘Entry’ declares a move constructor or move assignment operator" on the heap.pop() operation. I can post an example code as soon as I have time. – Heye Dec 12 '12 at 14:36
0

I don't have an answer per se - just a couple of ideas:

  1. If you're using GCC, I'd run some benchmarks with parallel mode enabled
  2. Are you sure that you're not dealing with denormalized numbers for the cost component?
Carl
  • 43,122
  • 10
  • 80
  • 104