1

Here I specifically only care about the GCC compiler and run time code efficiency.

Consider the following code try me

#include <iostream>
#include <map>

char Find(const std::map<int, char>& map, int key) {
    auto iter = map.find(key);
    if (iter == map.end()) 
        return 'X';
    return iter->second;
}

char Find2(const std::map<int, char>& map, int key) {
    return map.find(key)->second;
}

int main()
{
    // part 1
    std::map<int, char> x{{0,'0'}, {4,'4'}};
    std::cout << Find(x, 3) << std::endl;
    std::cout << Find(x, 4) << std::endl;
    std::cout << (int)Find2(x, 3) << std::endl; // returns 0
    std::cout << Find2(x, 4) << std::endl;

    // part 2: Find2 is a shortcut
    std::map<int, char> y(x);
    y.end()->second = 'X';
    std::cout << Find2(y, 3) << std::endl;
    std::cout << Find2(y, 4) << std::endl;
}

The part 2 also works for a GCC compiler I tested in Godbolt, despite it uses the end() in a weird way.

In GCC, does map allocate a node std::pair to represent the end? Will it be changing when elements are added/deleted? This is related to how the map's end() is really implemented, and I am curious to know it.

As many people pointed out, the C++ standard defines it as UB if a the end() is dereferenced.

However, according to this answer, which GCC seems to have implemented the map in a way that the end() is pointing to a root node. With this I think setting the root node's value to X here seems to be a valid operation. Does this mean that the above code should work for GCC?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
doraemon
  • 2,296
  • 1
  • 17
  • 36
  • 7
    `y.end()->second = 'X';` is undefined. `end` does not refer to an element. You shall not dereference it – 463035818_is_not_an_ai Apr 13 '23 at 08:54
  • 3
    Both "part 2" and `Find2` are broken: to deference `end()` is UB – Adriano Repetti Apr 13 '23 at 08:54
  • 3
    Code with undefined behaviour is not required to loudly break – Caleth Apr 13 '23 at 08:54
  • I wonder if a node is allocated (`new`-ed) to represent the `end()`. If yes, then it is not `UB`? – doraemon Apr 13 '23 at 08:57
  • 1
    using a sentinel is rarely a good solution. `char Find` could trhow an exception, or it could return `std::optional` or it could return a `std::pair` – 463035818_is_not_an_ai Apr 13 '23 at 08:58
  • @doraemon According to this answer , GNU's libstdc++ and LLVM's libc++ don't allocate an end node (it's a `nullptr`), but MSVC's STL does. That doesn't make this a valid usage (I think STL throws instead of UB) – Artyer Apr 13 '23 at 09:00
  • 1
    if you write code without ub, then it works everywhere with any compiler. You shouldn't write code that relies on implementation details, because they can change – 463035818_is_not_an_ai Apr 13 '23 at 09:02
  • @463035818_is_not_a_number I'd argue that `.end()` *is* a sentinel … – Konrad Rudolph Apr 13 '23 at 09:02
  • 1
    @KonradRudolph it is, though a sentinel iterator, op wants a sentinel value. An iterator can have a unique value, but any `char` can also appear in the map, hence is not good as sentinel – 463035818_is_not_an_ai Apr 13 '23 at 09:04
  • the above code runs for linux with gcc. Does it already mean it should not be `nullptr`? I just wanna to understand why it is UB, or why the presence of the sentinel node still makes it UB. If the sentinel node really exists in memory (heap or stack), why cannot I just use it. What is the sentinel node's address? – doraemon Apr 13 '23 at 09:10
  • 2
    [the end element acts as a placeholder; attempting to access it results in undefined behavior](https://en.cppreference.com/w/cpp/container/map/end) – teapot418 Apr 13 '23 at 09:16
  • 2
    If you want to language-lawyer this, then there is a "The library never assumes that past-the-end values are dereferenceable." in `iterator.requirements.general` in the C++ standard. – teapot418 Apr 13 '23 at 09:21
  • 4
    Quoting [cppreference.com's entry on map::end()](https://en.cppreference.com/w/cpp/container/map/end) : "Returns an iterator to the element following the last element of the map. This element acts as a placeholder; attempting to access it results in undefined behavior.". – Jesper Juhl Apr 13 '23 at 10:03
  • 1
    "GCC seems to have implemented the map in a way that the end() is pointing to a root node. With this I think setting the root node's value to X here seems to be a valid operation." - You can't rely on that. The next minor library version may (validly) break your assumption. If you write the code like that it is simply buggy. – Jesper Juhl Apr 15 '23 at 10:39
  • If you are worried about efficiency, you should be nowhere near an std::map. – bitmask Apr 21 '23 at 13:30

1 Answers1

4

Since I had read several GCC implementations of STL containers, I have a fair understanding of how the end() sentinel is implemented. Since the question refers to how end() is implemented for the std::map container, I will deal with it, although the idea applies to almost all STL node-based containers.

The majority of the std::map interface, like the other associative containers, is a wrapper around a _Rb_tree object, that is implemented in the stl_tree.h file and partly in the tree.tcc file. And it is precisely in the header file that the implementation of the base node, _Rb_tree_node_base, and the node,_Rb_tree_node, can be found. The node is a derived class of the base node, to which the storage member is added. To be precise, since C++11 the storage member is no more of type T, but is defined with the GCC version of the std::aligned_storage type.

The trick is to use an instance of the base node as a sentinel node, that represents both the past-the-last element - by definition, the sentinel node is a specifically designed node that is used as a path traversal terminator - and the before-the-first element. Essentially, the sentinel node is interpreted as both the element after the last element and the element before the first element. The GCC implementation has designed the std::map's sentinel node in order to keep track of the root element, the leftmost element and the rightmost element using its pointers to the parent, the left child and the right child, respectively.

The original implementation is the following:

_Base_ptr&
       _M_root() _GLIBCXX_NOEXCEPT
       { return this->_M_impl._M_header._M_parent; }

_Base_ptr&
       _M_leftmost() _GLIBCXX_NOEXCEPT
       { return this->_M_impl._M_header._M_left; }

_Base_ptr&
      _M_rightmost() _GLIBCXX_NOEXCEPT
      { return this->_M_impl._M_header._M_right; }

The _M_header member is then the sentinel node. There are several advantages to using this approach for node-based containers, but std::forward_list, and the main benefits are the following two:

  • since the end() iterator points directly to the sentinel node, it is not a dangling pointer and do not exhibits a particular behavior even if the container is empty - in this case, both begin() and end() refer to the same thing, the sentinel node;
  • since the sentinel node is used to represent both the past-the-last element and the before-the-first element, there will always be a valid node before the beginning and after the end and then insert, erase and splice operations can be performed at any position in the same way.

A nice side effect of this implementation is that if the iterator end() is incremented, it will land at the beginning of the container. As mentioned before, the std::forward_list container does not have the same features of the other node-based containers, because its sentinel node improperly represents only the before-the-first element, pointed by the before_begin() iterator, and the std::nullptr is used to represent the past-the-last element. Then, if the before_begin() iterator is incremented, it will land at the beginning of the sequence, but if the end() iterator is incremented, it will lead to UB.

For all other node-based containers, such as std::list, the following code is always valid:

std::list<int> l{0, 1, 2, 3, 4, 5, 6};
auto it{l.cend()};
if(++it == l.cbegin())
  std::cout << "Wow!\n";

You may have guessed from this that trying to dereference the end() iterator is UB.

LoS
  • 448
  • 1
  • 3
  • 15
  • Nice writeup, but it would greatly benefit from breaking the long paragraphs up into several easier to read paragraphs. That's just a nit, not a knock. – David C. Rankin Apr 21 '23 at 00:59
  • Thanks very much. I upvoted you answer. But I still have some questions for you to clarify: what is the root/leftmost/base/m_header node of the tree and parent/left of a node. I didn't see these in [this ref](https://stackoverflow.com/a/3487108/7026980). what is the end() related to the root/leftmost? I would appreciate it a lot if you can provide a graph for them (text based will do). Do you mean that base == root == leftmost when you said: "use the base node to represent the sentinel of the tree, i.e. the reference to both the root node and the leftmost node in the tree." – doraemon Apr 21 '23 at 05:51
  • the usage of "base node" and "node base" is also confusion. do they mean `_Rb_tree_node_base`? Or is `base node` a node of type `_Rb_tree_node_base`? "The _M_header member is just an instance of the base node" seems to indicate base node is a type. – doraemon Apr 21 '23 at 05:58
  • @doraemon, yes, base node is always referred to `_Rb_tree_node_base`. And yes, `_M_header` is an instance of the base node, that is simply a type. – LoS Apr 21 '23 at 19:44
  • @doraemon, the base node is not the root of the tree, but an almost-node (it would be a node if it had the `_M_storage` type) that is used as a sentinel, a flag, whose `_M_left` pointer always refers to the root node and `_M_parent` pointer always refer to the lefmost node. However, if the binary search tree is empty, the expression base == root == leftmost becomes true, since the pointers must not remain dangling (then, if no nodes are present, the sentinel ends to refer to itself). – LoS Apr 21 '23 at 19:51