1

While messing around with type-punning iterators, I came across the ability to do

std::vector<int> vec{ 3, 7, 1, 8, 4 };
int* begin_i = (int*)(void*)&*vec.begin();

std::cout << "1st: " << begin_i << " = " << *begin_i << std::endl;
begin_i++;
std::cout << "2nd: " << begin_i << " = " << *begin_i << std::endl;

Then I tried to do the same kind of thing with an std::unordered_set:

std::unordered_set<int> set{ 3, 7, 1, 8, 4 };
for (auto& el : set)
{ // Display the order the set is currently in
    std::cout << el << ", ";
}
std::cout << '\n' <<std::endl;

int* begin_i = (int*)(void*)&*set.begin();

std::cout << "1st: " << begin_i << " = " << *begin_i << std::endl;
begin_i++;
std::cout << "2nd: " << begin_i << " = " << *begin_i << std::endl;

But the output I got was:

4, 8, 1, 7, 3,

1st: [address] = 4
2nd: [address] = 0

I'm supposing this is because and an unordered set's elements are located in different parts of memory? I was confused here considering that I also printed the order the elements were stored in using a range-based loop.

My question is how does an std::unordered_set store its elements in memory? What happens when an element is added to the set? Where does it go in memory and how is that kept track of if it's not stored in an array-like container where the elements are one-right-after-the-other?

Drake Johnson
  • 640
  • 3
  • 19
  • Internally, the elements in the unordered_set are not sorted in any particular order, but organized into buckets depending on their hash values to allow for fast access to individual elements directly by their values (with a constant average time complexity on average). http://www.cplusplus.com/reference/unordered_set/unordered_set/ – Frederick Zhang Dec 12 '19 at 01:08
  • 3
    Starting in C++17, the elements of an unordered set are required to be stored in [things called "nodes"](https://en.cppreference.com/w/cpp/container/node_handle). The standard doesn't impose requirements on what a node is, but most implementations will treat it as a node in a linked list. Range-based loops use the container iterator, and the set iterator walks through the elements of the set using whatever mechanism is requited by the implementation. – Raymond Chen Dec 12 '19 at 01:11
  • @FrederickZhang Thank you. I've read the page you liked before, but I still don't quite understand. What exactly are "buckets"? How do they speed anything up? What does this look like in memory? – Drake Johnson Dec 12 '19 at 01:12
  • 1
    A detailed discussion of the design of an unordered set is too broad for this site. You can look in a data structures book for details of how hash tables are typically implemented. An unordered set is a special case of a hash table with no associated value. – Raymond Chen Dec 12 '19 at 01:15
  • 2
    The not really helpful answer would be that it's the choice of the implementation how that data is stored. But the general idea of `std::unordered_*` is to use a hash table under the hood, allowing for constant time lookup. As such, the corresponding iterators must contain some brains to allow for finding the next element; **an iterator for `std::unordered_*` is generally *not* a pointer**. When you do `&*set.begin()`, you are calling the custom `operator *` that the smart iterator implements, which does yield a pointer to the element. But the next element can be anywhere else. – cmaster - reinstate monica Dec 12 '19 at 01:15

3 Answers3

5

An unordered_set is implemented as a hash table using external chaining.

That basically means you have an array of linked lists (which are usually called "buckets"). So, to add an item to an unordered_set you start by hashing the new item you're doing to insert. You then take that hash and reduce it to the range of the current size of the array (which can/will expand as you add more items). You then add the new item at the tail of that linked list.

So, depending on the value produced by the hash, two consecutively inserted items may (and often will) be inserted in the linked lists at completely different parts of the table. Then the node in the linked list will typically be dynamically allocated, so even two consecutive items in the same linked list may be at completely unrelated addresses.

As I noted in an earlier answer, however, quite a bit more about this is actually specified in the standard than most people seem to realize. As I outlined there, it might be (barely) possible to violate the expectation and still (sort of) meet the requirements in the standard, but even at best, doing so would be quite difficult. For most practical purposes, you can assume it's something quite a bit like a vector of linked lists.

Most of the same things apply to an unordered_multiset--the only fundamental difference is that you can have multiple items with the same key instead of only one item with a particular key.

Likewise, there are also unordered_map and unordered_multimap, which are pretty similar again, except that they separate the things being stored into a key and a value associated with that key, and when they do hashing, the only look at the key part, not the value part).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
1

Rather than directly answer the question, I would like to address the "type-punning" trick. (I put that in quotes because the provided code does not demonstrate type-punning. Perhaps the code was appropriately simplified for this question. In any event, *vec.begin() gives an int, so &*vec.begin() is an int*. Further casting to void* then back to int* is a net no-op.)

The property your code takes advantage of is

*(begin_i       + 1) == *(vec.begin() + 1)  // Using the initial value of begin_i
*(&*vec.begin() + 1) == *(vec.begin() + 1)  // Without using an intermediary

This is a property of a contiguous iterator, which is associated with a contiguous container. These are the containers that store their elements in adjacent memory locations. The contiguous containers in the standard library are string, array, and vector; these are the only standard containers for which your trick is guaranteed to work. Trying it on a deque will probably seem to work at first, but the attempt will fail if enough is added to &*begin(). Other containers tend to dynamically allocate elements individually, so there need not be any relation between the addresses of elements; elements are linked together by pointers rather than by position/index.


So that I'm not ignoring the asked question:

An unordered set is merely required to organize elements into buckets. There are no requirements on how this is done, other than requiring that all elements with the same hash value are placed in the same bucket. (This does not imply that all elements in the same bucket have the same hash value.) In practice, each bucket is probably implemented as a list, and the container of buckets is probably a vector, simply because re-using code is cool. At the same time, this is an implementation detail, so it can very from compiler to compiler, and even from compiler version to compiler version. There are no guarantees.

JaMiT
  • 14,422
  • 4
  • 15
  • 31
-1

The way std::unordered_set stores its memory is implementation defined. Standart doesn't care as long as it satisfies the requirements.

In VS version it stores them inside an std::list (fast access is provided by creating and managing additional data) - so each element has also pointers towards prev and next is stored via new (at least that's what I remember from std::list).

ALX23z
  • 4,456
  • 1
  • 11
  • 18
  • Starting in C++17, the elements of an unordered set must be stored in "nodes" (which are returned by the `extract` method). – Raymond Chen Dec 12 '19 at 01:12
  • @RaymondChen the problem is that the type is unspecified. You can get it but cannot use it in a portable way. What is the point of accessing them if it's like this? – ALX23z Dec 12 '19 at 01:21
  • You can insert a node into another collection. That's pretty much the only thing you can do with it, but it's something. – Raymond Chen Dec 12 '19 at 03:57