3

I'm using the yaml-cpp library to parse yaml. Abbreviated sample:

YAML::Node def = YAML::LoadFile(defFile);
for (auto itemPair = def.begin(); itemPair != def.end(); ++itemPair) {
    // Grab a reference so `itemPair->second` doesn't need to be copied all over the place
    auto& item = itemPair->second;

    // A few instances of the below in series
    if (item["key"].IsDefined()) { doSomething(item["key"].as<std::string>()); }

    // Problem happens here
    if (item["issue"].IsDefined()) {
        if (!item["issue"].IsMap()) { continue; }
        for (auto x = item["issue"].begin(); x != item["issue"].end(); ++x) {
            LOG(INFO) << "Type before: " << item.Type() << " : " << itemPair->second.Type();
            auto test = x->first.as<std::string>();
            LOG(INFO) << "Type after: " << item.Type() << " : " << itemPair->second.Type();
            // Using item as a map fails because it no longer is one!
            // Next loop attempt also crashes when it attempts to use [] on item.
        }
    }

The problem happens in the nested loop, where the reference taken at the beginning of the snippet suddenly changes, however the variable it's referencing seems to be unaffected:

I1218 12:44:04.697798 296012 main.cpp:123] Type before: 4 : 4
I1218 12:44:04.697813 296012 main.cpp:125] Type after: 2 : 4

My understanding of references is that they act as an alias for another variable. I understand the yaml library might be doing some magic behind the scenes which would change the underlying data, but I can't comprehend why the reference seems to be getting updated yet the original value remains.

Edit: Some serious mind-blowing behaviour is happening here. The reference gets "reset" back to the correct value after any call to itemPair->second.Type(). Thus if I add another log call:

LOG(INFO) << "Type after: " << item.Type() << " : " << itemPair->second.Type();
LOG(INFO) << "Type afterer: " << item.Type() << " : " << itemPair->second.Type();

The result:

I1218 12:58:59.965732 297648 main.cpp:123] Type before: 4 : 4
I1218 12:58:59.965752 297648 main.cpp:125] Type after: 2 : 4
I1218 12:58:59.965766 297648 main.cpp:126] Type afterer: 4 : 4

Reproducible example:

test.yaml:

---
one:
    key: x
    issue:
        first: 1
two:
    key: y
    issue:
        first: 1
        second: 2

main.cpp same as above but with hardcoded test.yaml, LOG swapped for std::cout, and the mock function:

#include <iostream>
#include <yaml-cpp/yaml.h>

void doSomething(std::string x) { std::cout << "Got key: " << x << std::endl; }

int main() {
    YAML::Node def = YAML::LoadFile("test.yaml");
    for (auto itemPair = def.begin(); itemPair != def.end(); ++itemPair) {
        // Grab a reference so `itemPair->second` doesn't need to be copied all over the place
        auto& item = itemPair->second;

        // A few instances of the below in series
        if (item["key"].IsDefined()) { doSomething(item["key"].as<std::string>()); }

        // Problem happens here
        if (item["issue"].IsDefined()) {
            if (!item["issue"].IsMap()) { continue; }
            for (auto x = item["issue"].begin(); x != item["issue"].end(); ++x) {
                std::cout << "Type before: " << item.Type() << " : " << itemPair->second.Type() << std::endl;
                auto test = x->first.as<std::string>();
                std::cout << "Type after: " << item.Type() << " : " << itemPair->second.Type() << std::endl;
                std::cout << "Type afterer: " << item.Type() << " : " << itemPair->second.Type() << std::endl;
                // Using item as a map fails because it no longer is one!
                // Next loop attempt also crashes when it attempts to use [] on item.
            }
        }
    }
}

Result:

$ ./build/out
Got key: x
Type before: 4 : 4
Type after: 2 : 4
Type afterer: 4 : 4
Got key: y
Type before: 4 : 4
Type after: 2 : 4
Type afterer: 4 : 4
Type before: 4 : 4
Type after: 2 : 4
Type afterer: 4 : 4
Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
Vultour
  • 373
  • 4
  • 15
  • @john `auto& item = itemPair->second` – 463035818_is_not_an_ai Dec 18 '20 at 13:05
  • @largest_prime_is_463035818 OK I'll shut up now. – john Dec 18 '20 at 13:06
  • @John no worries, I was searching for it and only found it a second after your comment ;) – 463035818_is_not_an_ai Dec 18 '20 at 13:07
  • 6
    I don't know `YAML`, so I cannot help much, but this smells like UB, and a [mcve] is needed. Do you get that output with only this code? – 463035818_is_not_an_ai Dec 18 '20 at 13:09
  • The problem with yaml-cpp is that it seems to be very poorly documented. For example, `operator->` of nodes' `iterator` return some proxy class and there is not much said about what its `first` and `second` members are. Your `item` reference may be dangling but it's hard to find out. Generaly, I wouldn't recommend using such kind of a library. People complained about it quite a lot [here](https://stackoverflow.com/a/838428/580083). – Daniel Langr Dec 18 '20 at 13:53
  • 1
    It seems to be the most popular yaml parser for c++. I'd be happy to use something else but I think the choices are quite limited. – Vultour Dec 18 '20 at 13:56
  • 2
    Sanitizer tells me `AddressSanitizer: stack-use-after-scope` at `item["key"]`. That's odd. – KamilCuk Dec 18 '20 at 14:02
  • Thanks for telling me about the address sanitizer, seems like the reference is indeed causing issues. Could you explain why this can happen? As someone not well versed in C++ I was under the impression that references were an alias for another variable, I didn't know they could "diverge" this way. – Vultour Dec 18 '20 at 14:14
  • 2
    `YAML::Node` has reference semantics, so doing `auto item` instead of `auto& item` will not do unnecessary copying and will probably resolve the issue. – flyx Dec 18 '20 at 14:20
  • 1
    I have no idea. I can trigger the sanitizer with just `YAML::Node def = YAML::LoadFile("test.yaml"); auto& item = def.begin()->second; item["key"];`. Using `->second` directly or using `auto` instead of `auto&` seems to fix it. – KamilCuk Dec 18 '20 at 14:21
  • 2
    I am looking into the library source code and `operator->` applied to `iterator` seems to be creating a temporary object of type [`proxy`](https://github.com/jbeder/yaml-cpp/blob/master/include/yaml-cpp/node/detail/iterator.h#L31) (which copies its value) and then the `second` refers to its member variable ([inherited from `std::pair`](https://github.com/jbeder/yaml-cpp/blob/master/include/yaml-cpp/node/iterator.h#L20)). Storing a reference to this member variable does not prevent the temporary to be destroyed. Note that `operator*` of `iterator` also creates a copy of the stored value. – Daniel Langr Dec 18 '20 at 14:23

1 Answers1

1

Node is designed to hold a reference, and the iterator behaves like a pointer to std::pair<Node, Node> and will return a temporary Node. If you bind to that Node, you will bind to a destroyed Node. So you need a copy here. Change auto& to auto will solve the problem.

It is designed this way, because it doesn't want you to touch the underneath memory. Otherwise a possible dangling reference could be made when reallocating the memory.

An example for dangling reference:

std::vector<int> v{1};
auto &ref1 = v[0];

v.reserve(100); // reallocating, causing ref1 a dangling reference.

Also, I wrote why it is designed this way. See here: https://github.com/jbeder/yaml-cpp/issues/977#issuecomment-771041297 and I'll just copy it here.


Why reference is UB here.

When using ->, the iterator iter creates a temporary derefence result on the stack, returns its pointer, and destroys this object immediately after scope.

This is to make iter->second behaves alike the (*iter).second.

It is hard to decide when to destroy that object if you put that deref result on the heap.

The behavior is expected to be the same as (*iter).second. But (*iter).second is a rvalue, the compiler won't allow auto&. That's not the case in the iter->second, because the compiler thinks iter->second as a lvalue.

C++ standards makes p->m, the built-in member of pointer expression, a lvalue. So there's no way to forbid binding to reference.

In conclusion, the behavior is correct when

V list = iter->second;   // correct
V &list = iter->second;  // wrong
V &&list = iter->second; // COMPILE TIME ERROR
V &&list = std::move(iter->second); // still wrong

auto list = iter -> second;   // correct, list is V
auto &list = iter -> second;  // wrong,   list is V&
auto &&list = iter -> second; // wrong,   list is V&

V list = (*iter).second;   // correct
V &list = (*iter).second;  // COMPILE TIME ERROR
V &&list = (*iter).second; // correct

auto list = (*iter).second;   // correct, list is V
auto &list = (*iter).second;  // COMPILE TIME ERROR
auto &&list = (*iter).second; // correct, list is V&&

Here are some possible amendments for the author:

  1. Make detail::iterator_value object long living or just simply leak the memory.
  2. Remove operator->().
  3. Write into the document. Telling everyone to use auto.

Method 1 can cause much trouble. Method 2, 3 are good solutions, I think.

Why copy works like a reference here.

  1. In current design, every change went through a Node. Node is the public interface of the underlying memory. It is designed to become polymorphism. And the real type of the underlyding data is decided during running, which is unknown in the compile time. So auto& list = iter->second is impossible to bind to the correct underlying type.

This could be done with some efforts. It will be something like

auto& list = iter->second.data_as_ref<std::string>();

but still not that convenient to use.

  1. In the current design, you can get a copy by means of
auto list = iter->second.as<std::string>();

You cannot bind to it. It only allows you to copy, not write.

  1. That's because Node ensures you to use his interface to assign the value. It's quite important because assigning a data means 3 or more things to do.
  • If new data is one of the following type, it will encode it. std::pair, std::array, std::list, std::vector, std::map, bool, Binary
  • It assigns the data.
  • It assigns the type, one member in the enum class NodeType.
  • It assigns the status, a bool isDefined.

When reading, it also need to decode if the data was encoded. So it shouldn't give you direct write/read access.

  1. Also your ref could be dangling because a memory reallocation is possible.

Copy works like a reference is a must in the current design.

Conclusion

Use auto iter = iter->first; or use (*iter).first.

Steven Sun
  • 141
  • 6