0

Background: I'm new to C++. I have a std::map and am trying to search for elements by key.

Problem: Performance. The map::find() function slows down when the map gets big.

Preferred approach: I often know roughly where in the map the element should be; I can provide a [first,last) range to search in. This range is always small w.r.t. the number of elements in the map. I'm interested in writing a short binary search utility function with boundary hinting.

Attempt: I stole the below function from https://en.cppreference.com/w/cpp/algorithm/lower_bound and did some rough benchmarks. This function seems to be much slower than map::find() for maps large and small, regardless of the size or position of the range hint provided. I replaced the comparison statements (it->first < value) with a comparison of random ints and the slowdown appeared to resolve, so I think the slowdown may be caused by the dereferencing of it->first.

Question: Is the dereferencing the issue? Or is there some kind of unnecessary copy/move action going on? I think I remember reading that maps don't store their element nodes sequentially in memory, so am I just getting a bunch of cache misses? What is the likely cause of the slowdown, and how would I go about fixing it?

/* @param first     Iterator pointing to the first element of the map to search. 
 * @param distance  Number of map elements in the range to search.
 * @param key       Map key to search for. NOTE: Type validation is not a concern just yet.
 */    
template<class ForwardIt, class T>
ForwardIt binary_search_map (ForwardIt& first, const int distance, const T& key) {
    ForwardIt it = first;
    typename std::iterator_traits<ForwardIt>::difference_type count, step;
    count = distance;  
          
    while (count > 0) {
        it = first;
        step = count/2;
        std::advance(it, step);
        if (it->first < value) {
            first = ++it;
            count -= step + 1;
        }
        else if (it->first > value)
            count = step;
        else {
            first = it; 
            break; 
        }
    }
    return first;
}
C. Ventin
  • 135
  • 7
  • I haven't looked at your problem for very long but the obvious thing I see is that `std::advance` call you make has to traverse the map. This is quite different from how the map is designed, where it's mostly balanced and set up for optimal binary search. So your "guess" has to be much better than O(logN). It would seem that it's not. – paddy Aug 07 '21 at 04:25
  • I replaced the compare expressions with a compare of random ints, and the slowdown resolved. Wouldn't std::advance have continued to cause slowdown? – C. Ventin Aug 07 '21 at 04:28
  • 1
    Look at how your standard library's `find` is implemented to see how they are doing it. (C++ is notoriously hard to micro-benchmark, "rough benchmarks" are as likely to be irrelevant as not.) – Mat Aug 07 '21 at 04:33
  • if you need element by key you do this thing `std::map c; c["sekg"] = 5; std::cout << c["sekg"];` No? – Алексей Неудачин Aug 07 '21 at 04:44
  • Алексей Неудачин, the issue being, what if the map is very large and you need to find a lot of elements? Range hinting would greatly reduce search time, even vs. O(log N) complexity in the map size. – C. Ventin Aug 07 '21 at 04:47
  • @Mat I checked out the libstdc++ implementation headers on stl_map and stl_tree. map::find() calls tree::find(), which uses tree::lower_bound(). Unfortunately, STL does not require type map to provide external access to the tree, so I'm missing direct access to the nodes, I think. Let me know if you know a way. Or, perhaps . . . should I drop std::map and just make my own RB tree? – C. Ventin Aug 07 '21 at 04:57
  • 1
    Before making your own, check if there isn't some collection in Boost that's more appropriate for your use-case – Mat Aug 07 '21 at 05:02
  • 1
    Tree iterators are bidirectional iterators. `std::advance(it, count)` on a bidirectional iterator calls `++it` or `--it` in a loop `count` times. `operator++` in a tree needs to find the next node, which is an O(log n) operation, where `n` is the size of the container. – n. m. could be an AI Aug 07 '21 at 06:43
  • 2
    Binary search only makes sense, if you've got a random access iterator. `std::map`does not provide a random access iterator, so every time you want to get an iterator with (approximately) equal distance to 2 other iterators, the complexity of getting there is`Ω(n)`which means doing the binary search has at least the same complexity; assuming your comparison operation is not expensive the algorithm will be simpler and probably at least as efficient; to improve performance for strings that are lexicographically close starting the comparison at the end of the string may improve performance though – fabian Aug 07 '21 at 07:26

1 Answers1

3

There is a reason that std::map::find() exists. The implementation already does a binary search, as the std::map has a balanced binary tree as implementation.

Your implementation of binary search is much slower because you can't take advantage of that binary tree. If you want to take the middle of the map, you start with std::advance it takes the first node (which is at the leaf of the tree) and navigates through several pointers towards what you consider to be the middle. Afterwards, you again need to go from one of these leaf nodes to the next. Again following a lot of pointers.

The result: next to a lot more looping, you get a lot of cache misses, especially when the map is large.

If you want to improve the lookups in your map, I would recommend using a different structure. When ordering ain't important, you could use std::unordered_map. When order is important, you could use a sorted std::vector<std::pair<Key, Value>>. In case you have boost available, this already exists in a class called boost::container::flat_map.

JVApen
  • 11,008
  • 5
  • 31
  • 67
  • I'm looking into using a different data structure based on some comments on the question. It looks like boost::container::flat_map is quite a bit slower than map, which mostly defeats the purpose. See very recent benchmarking someone has done, here: https://stackoverflow.com/questions/21166675/boostflat-map-and-its-performance-compared-to-map-and-unordered-map# Flat_map iteration is faster, as you suggest, presumably because it provides a random access iterator, but search is still slower than std::map. For the other option, how do you efficiently insert-sort in a vector of pairs? – C. Ventin Aug 07 '21 at 17:18
  • 1
    Depends on your use cases. I usually find myself creating them once and querying a lot. Barely updating them, so if updating is really important you might want to delay sorting till you query? – JVApen Aug 07 '21 at 17:48