2

Think of collections of different types like Position, Color, Name. Instances of those can be connected by using the same key in the collection. Keys are global unique identifiers of 64 bit length. Currently, I use a hash map but this is not ideal.

// custom types
struct Position { float x, y, z; bool static; };
enum Color { RED, BLUE, GREEN };

// collections
std::unordered_map<uint64_t, Position> positions;
std::unordered_map<uint64_t, Color> colors;
std::unordered_map<uint64_t, std::string> names;

// add some data
// ...

// only access positions collection,
// so the key is not needed here
for (auto i = positions.begin(); i != positions.end(); ++i) {
    if (i->second.static) continue;
    i->second.x = (rand() % 1000) / 1000.f;
    i->second.y = (rand() % 1000) / 1000.f;
    i->second.z = (rand() % 1000) / 1000.f;
}

// access to all three collections,
// so we need the key here
for (auto i = positions.begin(); i != positions.end(); ++i) {
    uint64_t id   = *i->first;
    auto position = i->second;
    auto color    = colors.find(id);
    auto name     = names.find(id);
    if (color == colors.end() || name == names.end()) continue;
    draw(*name, *position, *color);
}

I try to separate the collections, but as you can see, it also happens that collected instances of multiple collections are needed at the same time. Of course, I also need to add or delete single elements from time to time, but these cases are not performance critical.

Now I'd like to optimize iterating over individual collections. Therefore, I try to get the collections contiguously stored, which is part of the idea of data oriented design. However, I still need to access individual instances very quickly. Directly using an array doesn't work since that would allocate way too much memory and not all instances of one type have counterparts of another type. On the other hand hash maps are not optimal for iteration.

I think the data structure have to internally use an array. Which data type should I use here? And is there such a data structure implemented in the C++ standard library?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
danijar
  • 32,406
  • 45
  • 166
  • 297
  • There are only 2 hash table collection (that I know of) in the standard library - `unordered_map/set`. – Bernhard Barker Mar 09 '14 at 16:12
  • I was with you up to "collections memory aligned". Exactly what do you want to be aligned with what else, and exactly how do you plan to make use of this property (whatever it is)? – Igor Tandetnik Mar 09 '14 at 16:43
  • @IgorTandetnik The values should be memory aligned. For example, I think of an array that has wrappers to access elements by key using a hash table. However, I am not sure if this is the best data structure for my task. – danijar Mar 09 '14 at 17:47
  • All objects in a valid C++ program are memory aligned according to the alignment requirements of their type, whether stored in STL containers or otherwise. Under what circumstances do you expect to end up with an object that is **not** so aligned? I do not understand the nature of your concern. – Igor Tandetnik Mar 09 '14 at 17:59
  • 1
    By "memory aligned", do you, by any chance, mean "stored in a contiguous storage"? – Igor Tandetnik Mar 09 '14 at 18:02
  • @IgorTandetnik For arrays and `std::vector`s it is guaranteed that their values are stored next to each other and in defined order in RAM. In contrast, for `std::unordered_map`s their values may be spread over different locations in the application's memory, for example when you insert elements later on. Because of how CPU caching works, it is much faster to iterate over an continuous block of memory. Edit: My concern is about alignment of values of the collection, not about the alignment of members inside that values. – danijar Mar 09 '14 at 18:54
  • 1
    @danijar the short answer to Igor's question would then be "yes". Alignment is the wrong word: you mean "contiguously stored". – Yakk - Adam Nevraumont Mar 09 '14 at 20:57
  • @Yakk Alright, didn't know that. Sorry for the confusion. I updated the question accordingly. – danijar Mar 10 '14 at 08:14

1 Answers1

3

Create an unordered_map to offsets into a std::vector.

Store your elements in the std::vector. When you want to remove an element, swap it with the last element of the std::vector, and delete the last element. Go through the unordered_maps that store indexes into your std::vector and repair those that pointed to the last element to point to where the last element now is.

Removing an element is now O(n). If you remove a bunch of elements at once, you can do all of them in one O(n) pass with a little creativity.

Adding an element remains O(1).

Iterating over all of the elements involves iterating over the std::vector. If you need the hash value while iterating, store it redundantly there, or calculate it on the fly.

Wrap the std::vector and std::unordered_map behind a type that enforces the above rules, as otherwise the invariants will never persist, that exposes a map-like interface externally.

As an alternative to O(n) removal, create a parallel std::vector storing std::unordered_map<???>::iterators. Keep careful track of when a rehash occurs in the std::unordered_map (rehashing is deterministic, but you have to work out when it happens yourself), and when it does rebuild it (as all of the iterators will be invalidated by a rehash). Rehashes occur rarely, so this has an amortized constant cost1. When you want to erase, you can now update the index in the unordered_map in O(1) time (remember to update the second std::vector as well -- swapping position from last to the deleted element). You could store this in the same std::vector as the raw data, but that would bulk it up (which will slow down iteration).


1 Rehashes happen when the container passes reaches max_load_factor()*bucket_count() size. The bucket_count then grows exponentially, and the elements are moved around. Much like a std::vector's growth algorithm, this guarantees a linear-in-number-of-elements total elements moved -- so it maintains amortized constant insertion. The manual rebuild of your reverse table is no more expensive than the moving of all of the existing elements, so it also contributes an amortized constant insertion time factor.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • 1
    Thanks a lot for the detailed explanation. I will go with the second approach, because it is easy to get the key inside a loop then. Because I use GUIDs generated by boost as keys, I provided a hashing function to the unordered_map that just returns the key truncated to size_t. Rehashing can't occur then, can it? Second, question, why is it better to store iterators instead of just the keys? – danijar Mar 10 '14 at 08:00
  • 1
    @danijar rehashing is when the `unordered_map` reallocates in order to grow. The iterators should be marginally faster to dereference that it is to do a lookup: but yes, doing the lookup is probably a better idea as you do not have the headache of tracki g rehases: my brain skipped a beat and missed that. Just do the lookup and fix the index. – Yakk - Adam Nevraumont Mar 10 '14 at 12:16
  • @Yakk Alright. Last question, is there a common name for this type of data structure? That would help me find further reading. – danijar Mar 10 '14 at 12:57
  • @danijar bobah claims it is an interner? I just made it up in response to your question, hence the brain skip over the reverse lookup/iterators. In theory, a `boost::multi_index` might be able to do this without doing it manually (but it might have accounting information in the `vector` equivalent, or use indirection -- profile profile profile!) – Yakk - Adam Nevraumont Mar 10 '14 at 13:51
  • @Yakk I want to implement it myself. But if it would've been a common data structure the might be some reference implmentations for me. – danijar Mar 10 '14 at 14:46