31

Consider the following code:

unordered_set<T> S = ...;

for (const auto& x : S)
   if (...)
       S.insert(...);

This is broken correct? If we insert something into S then the iterators may be invalidated (due to a rehash), which will break the range-for because under the hood it is using S.begin ... S.end.

Is there some pattern to deal with this?

One way is:

unordered_set<T> S = ...;

vector<T> S2;

for (const auto& x : S)
   if (...)
       S2.emplace_back(...);

for (auto& x : S2)
    S.insert(move(x));

This seems clunky. Is there a better way I'm missing?

(Specifically if I was using a hand-rolled hash table and I could block it from rehashing until the end of the loop, it would be safe to use the first version.)

Update:

From http://en.cppreference.com/w/cpp/container/unordered_map/insert

If rehashing occurs due to the insertion, all iterators are invalidated. Otherwise iterators are not affected. References are not invalidated. Rehashing occurs only if the new number of elements is higher than max_load_factor() * bucket_count().

Could you mess with max_load_factor somehow to prevent rehashing?

David G
  • 94,763
  • 41
  • 167
  • 253
Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • 1
    I believe you might be right, but you can use the `iterator` insert overload to do rid of the second loop. `s.insert(x.begin(), x.end());` (you can use `std::make_move_iterator` if T is not POD). – 111111 Dec 20 '12 at 23:01
  • 1
    `s.insert(std::make_move_iterator(x.begin()), std::make_move_iterator(x.end()));` more verbose but I still prefer it – 111111 Dec 20 '12 at 23:02
  • @Andrew & 111111: I don't know how much rep is needed to see deleted answers, but in case you cannot: you're right! I do think the use of range-based for loop here is a red herring through. – GManNickG Dec 20 '12 at 23:18
  • @GManNickG: range-based for is basically equivilant to a normal for over begin...end iterator range: http://stackoverflow.com/questions/9657708/c11-the-range-based-for-statement-range-init-lifetime – Andrew Tomazos Dec 20 '12 at 23:20
  • @AndrewTomazos-Fathomling: [Right](http://stackoverflow.com/questions/2648878/range-based-for-statement-definition-redundancy), but as I mentioned you lose controll over the iterator, which is pretty important to have when you're modifying the container (be it a `std::vector<>`, `std::set<>`, etc.), as you need to update the iterator to be valid again. Your question is more focused on the unique iteration properties of the container rather than the iteration method, I think. – GManNickG Dec 20 '12 at 23:25
  • @GManNickG: Sure, but even if you have control of the iterator in this case it is lost (and irrecovable) across a rehash. In fact the order of iteration is psuedo randomly shuffled across a rehash if its implemented in the normal way. My Q is specifically about unordered_set (and map). – Andrew Tomazos Dec 20 '12 at 23:32
  • @AndrewTomazos-Fathomling: Right, which is why I think the looping machanism introduces a loss of control not related. :) But anyway, I see what you want now, hope that helps. EDIT: I give up! – GManNickG Dec 20 '12 at 23:51
  • Note your question body and the bottom question don't match. My second deleted answer shows how to prevent rehashing, but does not handle iterating. – GManNickG Dec 20 '12 at 23:52
  • @GManNickG: If rehashing is blocked than the first code in my question will work. – Andrew Tomazos Dec 21 '12 at 00:04
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/21487/discussion-between-gmannickg-and-andrew-tomazos-fathomling) – GManNickG Dec 21 '12 at 00:06

3 Answers3

23

Could you mess with max_load_factor somehow to prevent rehashing?

Yes, you can set the max_load_factor() to infinity to ensure no rehashing occurs:

#include <iostream>
#include <limits>
#include <unordered_set>

int main()
{
    // initialize
    std::unordered_set<int> S;

    for (int i = 0; i < 8; ++i)
        S.insert(i);

    std::cout << "buckets: " << S.bucket_count() << std::endl;

    // infinite max load factor => never need to rehash
    const auto oldLoadFactor = S.max_load_factor();
    S.max_load_factor(std::numeric_limits<float>::infinity());

    for (const auto& x : S)
    {
        if (x > 2)
            S.insert(x * 2);
    }

    // restore load factor, verify same bucket count
    S.max_load_factor(oldLoadFactor);
    std::cout << "buckets: " << S.bucket_count() << std::endl;

    // now force rehash
    S.rehash(0);
    std::cout << "buckets: " << S.bucket_count() << std::endl;
}

Note that simply setting a new load factor does no rehashing, so those are cheap operations.

The rehash(0) bit works because it's a request that: 1) I get at least n buckets, and 2) I have enough buckets to satisfy my max_load_factor(). We just use zero to indicate we don't care for a minimum amount, we just want to rehash to satisfy our "new" factor, as if it was never changed to infinity.

Of course, this isn't exception-safe; if anything throws between the calls to max_load_factor(), our old factor is lost forever. Easily fixed with your favorite scope-guard utility or a utility class.

Note that you get no guarantees if you'll iterate over the new elements. You will iterate over the existing elements, but you may or may not iterate over the new elements. If that is okay (which per our chat it should be), then this will work.

For example, consider you iterate over an unordered set of integer and for each even integer x, insert x * 2. If those always get inserted just after your currrent position (by chance of implementation-detail and container state), you will never terminate the loop except through exceptions.

If you do need some guarantees, you need to with an alternate storage solution.

GManNickG
  • 494,350
  • 52
  • 494
  • 543
  • Setting a large max_load_factor increases the chance of hash collision, which significantly degrade performance. I'd rather reserve the number of buckets instead. It accomplishes the same goal, by trading the performance hit for more memory usage. – qft Jun 17 '17 at 07:16
  • Is it guaranteed that you will iterate over ALL the existing elements (those present prior to the loop) elements? – 24n8 May 24 '20 at 08:08
6

Modifying any container while you're iterating over it tends to get hairy - even if it's a simpler structure than a hash, or even if you can prevent it from re-hashing, re-balancing or whatever.

Even if it did work, by the way, there's an ambiguity: should your newly-inserted members be iterated over or not? Is it ok to include them in this iteration only sometimes (ie, only if they happen to end up after the current iterator)?

If you need to do this a lot, you could usefully wrap the container in a generic adapter that defers all the inserts until the end, but you're really finding a way to hide the code you already have.

Useless
  • 64,155
  • 6
  • 88
  • 132
2

I realized that it is conceptually the same as what you proposed but I think it looks actually reasonably slick:

std::vector<T> tmp;
std::copy_if(S.begin(), S.end(), std::back_inserter(tmp),
             [](T const& value) { return ...; });
S.insert(std::make_move_iterator(tmp.begin()),
         std::make_move_iterator(tmp.end()));
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • :) Actually I think the simple imperative style is far more elegant, readable and more broadly understandable. My maintenance programmer would shoot me if he saw what you wrote here. :) But you are correct both will compile to the same code with unnecessary moves. – Andrew Tomazos Dec 20 '12 at 23:56
  • Actually your code has an extra move (lamdba return to push_back(&&) rather than emplace), but will most likely be elided by rvo. – Andrew Tomazos Dec 20 '12 at 23:59
  • 1
    I don't think I have an extra move because the lambda return is the Boolean value of the predicate and the elements of the map won't be moved but will be copied anyway: Since the types are identical there isn't any temporary being created. That said, I think a descriptive programming style is far superior. It may be unfamiliar but it actually expresses what the does rather than having the reader wondering what the loops do. – Dietmar Kühl Dec 21 '12 at 00:03
  • Oh I see, in that case your code is wrong sorry. In my version the newly inserted item is an arbitrary expression. In your version you are adding some subset of S back to S which will always be a no-op and makes no sense. See how the evil functional style has confused you (and me). :) You should change to imperative. :) – Andrew Tomazos Dec 21 '12 at 00:08
  • No, my code expressed what I understood from your imperative programming! :-) ... but you are right this doesn't make sense and there is no `std::transform_if()` which would add the object based on the condition and transform it. – Dietmar Kühl Dec 21 '12 at 00:12