2

I'd like to extend this SO question to treat a non-trivial use-case.

Background: pyyaml is pretty sweet insofar as it eats YAML and poops Python-native data structures. But what if you want to find a specific node in the YAML? The referenced question would suggest that, hey, you just know where in the data structure the node lives and index right into it. In fact pretty much every answer to every pyyaml question on SO seems to give this same advice.

But what if you don't know where the node lives in the YAML in advance?

If I were working with XML I'd solve this problem with an xml.etree.ElementTree. These provide nice facilities for loading an XML document into memory and finding elements based on certain search criteria. See find() and findall().

Questions:

  1. Does pyyaml provide search capabilities analogous to ElementTree? (If yes, feel free to yell at me for being bad at Google.)
  2. If no, does anyone have nice recipe for extending pyyaml to achieve similar things? (Bonus points for not traversing the deserialized YAML all over again.)

Note that one important thing that ElementTree provides in addition to just being able to find things is the ability to modify the XML document given an element reference. I'd like to be able to do this on YAML as well.

Community
  • 1
  • 1
BrianTheLion
  • 2,618
  • 2
  • 29
  • 46

2 Answers2

1

Do you know how to search through python objects? then you know how to search through the results of a yaml.load()...

YAML is different from XML in two important ways: one is that while every element in XML has a tag and a value, in YAML, there can be some things that are only values. But secondly... again, YAML creates python objects. There is no intermediate in-memory format to use.

E.G. if you load a YAML file like this:

- First
- Second
- Third

you'll get a list like ['First', 'Second', 'Third']. Want to find 'Third' and don't know where it is? You can use [x for x in my_list if 'Third' in x] to find it. Need to lookup an item in a dictionary? Just do it.

If you want to modify an object, you don't modify the YAML, you modify the object. E.G. now I want the second entry to be in German. I just do 'my_list[1] = 'zweite', modifying it in place. Now the python list looks like ['First', 'zweite', 'Third'], and dumping it to YAML looks like

- First
- zweite
- Third

Note that PyYAML is pretty smart... you can even create objects with loops:

>>> a = [1,2,3]
>>> b = {}
>>> b[1] = a
>>> b[2] = a
>>> print yaml.dump(b)
1: &id001 [1, 2, 3]
2: *id001
>>> b[2] = [3,4,5]
>>> print yaml.dump(b)
1: [1, 2, 3]
2: [3, 4, 5]

In the first case, it even figured out that b[1] and b[2] point to the same object, so it created links and automatically put a link from one to the other... in the original object, if you did something like a.pop(), both b[1] and b[2] would show that one entry was gone. If you send that object to YAML, and then load it back in, that will still be true.

(and note in the second one, where they aren't the same, PyYAML doesn't create the extra notations, as it doesn't need to).

In short: Most likely, you're just overthinking it.

Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
  • You seem to be assuming that all objects you load from a YAML file are always lists or dicts, and not more complex objects for which iteration might not be defined. So you assumption only holds if you safeload YAML and throw away the interesting most interesting details of the YAML file (which objects to load) while doing so. – Anthon Sep 22 '15 at 19:07
  • it seems like you'd have to know _something_ about the format before you load it, though... it's not any different than calling some other unknown API and getting an object of some unknown format and depth. In general, like you said, python objects can represent a much richer and more extensive set of associations than what can be represented in an XML... all XML values are strings, for instance, whereas YAML values can be anything. – Corley Brigman Sep 22 '15 at 19:37
  • 1
    XML has somewhat more structure: simplified they are tags with attributes and nested tags (with attributes), and that makes it possible to ask things like give me a tag X with attribute Y and a child Z somewhere in its descendants with attribute W. And that works: it finds something or it stops not finding nothing. In YAML, because of the self reference, this might never stop. In addition you have less structure regularity (not just tags, but mapping **and** sequences). But what really complicates things is that !objecttyping can make {a: b} mean something different context depending. – Anthon Sep 22 '15 at 19:45
  • This is great discussion! Thanks! – BrianTheLion Sep 22 '15 at 20:17
1

The answer to question 1 is: no. PyYAML implements the YAML 1.1 language standard and there is nothing about finding scalars by any path in the standard nor in the library.

However if you safeload a YAML structure, everything is either a mapping, a sequence or a scalar. Even such a simplistic representation (simple, compared to full fledged object instantiation with !typemarkers), can already contain recursive self referencing structures:

&a x: *a 

This is not possible in XML without external semantic interpretation. This makes making a generic tree walker much harder in YAML than in XML. The type loading mechanism of YAML also makes it much more difficult to generic tree walker, even if you exclude the problem of self references.

If you don't know where a node lives in advance, you still need to know how to identify the node, and since you don't know how to you would walk the parent (which might be represented in multiple layers of combined mappings and sequences, it is almost almost useles to have a generic mechanism that depends on context.

Without being able to rely on context (in general) the thing that is left is a uniquely identifiable value (like the HTML id attribute). If all your objects in YAML have such a unique id, then it is possible to search the (safeloaded) tree for such an id value and extract any structure underneath it (mappings, sequences) until you hit a leaf (scalar), or some structure that has an id of its own (another object).

I have been following the YAML development for quite some time now (earliest emails from the YAML mailing list that I have in my YAML folder are from 2004) and I have not seen anything generic evolve since then. I do have some tools to walk the trees and find things that I use for extracting parts of the simplified structure for testing my raumel.yaml library, but no code that is in a releasable shape (it would have already been on PyPI if it was), and nothing near to a generic solution like you can make for XML (which is IMO, on its own, syntactically less complex than YAML).

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • @CorleyBrigman I never understood why you make something human readable and than throw away the comments (and the key ordering that is implicit in a document). I also recently published [PON](https://pypi.python.org/pypi/pon) to PyPI. Using Python's internal parser for a config file format alternative to YAML/JSON/INI. Of course with round-trip and comment preservation. – Anthon Sep 24 '15 at 07:34