0

I am doing a problem in c++ that has to keep track of points that are visited in a traversal. The point is basically,

struct Point {
  int x;
  int y;
};

My first thought to solving something like this would be to use something like

std::set<Point> visited_points;

or maybe

std::map<Point, bool> visited_points;

However, I am a beginner in c++, and I realized you have to implement a Compare, which I didn't know how to do. When I asked, I was told said that using a map was "overkill" in a problem like this. He said the better solution was to do something like

std::vector<std::vector<bool>> visited_points;

He said std::map was not the best solution, since using a vector was faster.

I'm wondering why using a double vector is better in terms of style and performance. Is it because implementing a Compare is hard for a Point? A double vector feels hacky to me, and I also think it looks uglier than using a set or map. Is it really the best way to approach this problem, or is there a better solution I don't know about?

Blubber
  • 1,375
  • 1
  • 15
  • 32
  • 2
    "He said std::map was not the best solution, since using a vector was faster." - he (whoever "he" is) was probably wrong, depending on the number of possibly visited points. A set is probably a good solution. I actually have an old blog article on the subject: https://latedev.wordpress.com/2013/08/12/less-than-obvious/ –  Mar 08 '18 at 17:35
  • Possible duplicate of [In which scenario do I use a particular STL container?](https://stackoverflow.com/questions/471432/in-which-scenario-do-i-use-a-particular-stl-container) – Smit Ycyken Mar 08 '18 at 17:42
  • For a small number of points `map` or `set` might be overkill, but `vector` will not scale to large lists well. Sooner or later the speed-by-simplicity of the `vector` will lose out to the volume of the data that must be searched. – user4581301 Mar 08 '18 at 17:47
  • When the number of points is low it's likely the whole `vector` fits in CPU cached memory. So iterating is fast, Compare func is also in CPU mem. A `map` is sorted, which means delays due to internally sorting items. – Ripi2 Mar 08 '18 at 17:48
  • 1
    A vector of vectors isn't inherently bad. However, a vector bools you do need to be careful with - they don't function quite the same as vectors of other types (the differences might be beyond what you need to know at this stage, but bear it in mind.) – Rags Mar 08 '18 at 17:49
  • There is a easy solution to determine which is faster for the data set, create a timer and run it on a loop a 1,000 times, then average out the results to get an average run time, it will show you which approach is faster (usually). – Matthew Lagerwey Mar 08 '18 at 17:52
  • 3
    *"implementing a Compare is hard for a Point"*. Really not `struct LessPoint { bool operator()(const Point& lhs, const Point& rhs) const { return std::tie(lhs.x, lhs.y) < std::tie(rhs.x, rhs.y); } };`. and then `std::set` – Jarod42 Mar 08 '18 at 17:52
  • Other note on `vector` of `vector`. You have a `vector` that contains other `vector`s. A `vector` is basically a pointer to a datastore and some extra book-keeping (often a couple more pointers). Each `vector`'s datastore is unlikely to be anywhere near another's in storage and non-contiguous data generally results in a higher number of cache misses. – user4581301 Mar 08 '18 at 17:57
  • @NeilButterworth Thanks for the link to your blog post, I think that helped me understand how the map works. – Blubber Mar 08 '18 at 18:10
  • @user4581301: Scaling of the vector of vector isn't dependent on the number of points in the list, but on the canvas size (how many potential locations are there for points). – Ben Voigt Mar 08 '18 at 18:59

1 Answers1

3

If someone asks you, in abstract, "What is the best way of keeping track of objects I've visited?", then you would be forgiven for replying "Use an std::unordered_set<Object>" (usually called a hash table for languages other than C++). That's a nice simple answer and it is often correct if you don't know anything at all about the objects. After all, a hash lookup is (expected) O(1), and in practice is usually quite fast.

There are a few caveats, the biggest one being that you will need to be able to compute a hash for each object. The C++ standard library does not (yet) come with a framework for computing hashes of arbitrary objects, not even PODs, and rendering an object as a string in order to be able to take advantage of std::hash<std::basic_string> is usually way too much work (unless the object is already a string, of course).

If you can't figure out how to write a hash function for you object, you might then think about using an ordered associative container (aka a balanced BST). However, that is not a good idea. Not because it is difficult to write a comparison function. Writing comparison functions is usually trivial, particularly for PODs; you can leverage the fact that std::tuple implements a comparison function for every tuple whose element types are all comparable.

The real problem with ordered associative containers is that they are high overhead. Element access is slow: O(log n), not O(1), and the constant is not small either. And the bookkeeping data required to maintain the balanced tree is much larger than the two-pointer hash-table node (and even that is quite big for small objects). So ordered associative containers really only make sense if you need to be able to traverse them in order. Generally, "visited" maps don't need to be traversed at all -- they are just used for lookup.

Both ordered and unordered containers have another problem: the objects in the container are individual dynamic memory allocations (the API requires that references to the objects in the container must be stable), so over time the individual objects end up getting scattered across dynamic memory, leading to a lot of cache misses.

But, really, even before you start thinking about how easy (or difficult) it will be to hash your objects in order to keep them in a hash-set, you should think about the nature of the objects you are tracking. In particular, can they be easily indexed with a small(-ish) integer? If so, you could just use a vector of bits, one bit per possible object. That's an efficient representation, both for access speed (definitely O(1)) and for space, and it is optimal for memory caching.

If your objects are easily numbered then bit-vectors will be an attractive alternative. One bit per object is (literally) two orders of magnitude less space than a hash-map, so unless you expect your visited map to be extremely sparse (rarely the case in algorithms which need a visited map), it's going to be a big win.

In the case of your problem, which I gather has to do with keeping track of points visited in a rectangular array such as a gameboard or an image, it is clear that the bit vector approach is going to work out well. It's true that you require two levels of indexing (unless you reduce the two indices into a single integer, which is quite easy if you know the dimensions), but that doesn't add much overhead.

Although there are doubts about how good an idea it was, the C++ standard library special cases std::vector<bool> to really be a bit vector. That makes it impossible to create a native pointer to a single element of the vector (which is why many people consider std::vector<bool> to be a hack), and creates some other odd issues when you try to use it as a vector. But if all you want is a bitmask -- as in the case of a visited map -- then it is a pretty good solution.

C++ also offers real bit vectors -- std::bitset -- but unfortunately these need to have their size known at compile time. Boost offers dynamic_bitset, which is a kind of std::vector<bool> written with hindsight, so it's also worth looking at.

rici
  • 234,347
  • 28
  • 237
  • 341