16

We are in need of parsing YAML files which contain duplicate keys and all of these need to be parsed. It is not enough to skip duplicates. I know this is against the YAML spec and I would like to not have to do it, but a third-party tool used by us enables this usage and we need to deal with it.

File example:

build:
  step: 'step1'

build:
  step: 'step2'

After parsing we should have a similar data structure to this:

yaml.load('file.yml')
# [('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

dict can no longer be used to represent the parsed contents.

I am looking for a solution in Python and I didn't find a library supporting this, have I missed anything?

Alternatively, I am happy to write my own thing but would like to make it as simple as possible. ruamel.yaml looks like the most advanced YAML parser in Python and it looks moderately extensible, can it be extended to support duplicate fields?

jakubka
  • 706
  • 1
  • 9
  • 23
  • 1
    I need to have the yaml with duplicate keys parsed, not just recognise that there are duplicate keys. Unless I am missing something, the links you provided won't do that? – jakubka Jul 04 '17 at 12:00
  • Can you tell us what 3rd party tool generates such YAML? (YUNK?) – Anthon Jul 04 '17 at 13:25
  • 1
    @Anthon the tool we use is Drone CI and it doesn't generate it, but merely accepts it as a valid input. It basically ignores the key names and only cares about a content and order. We are building some analysis tooling over files we feed to Drone CI and thus we need to be able to parse the files. – jakubka Jul 04 '17 at 22:17

4 Answers4

13

PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a DuplicateKeyFutureWarning if used with the legacy API, and raise a DuplicateKeyError with the new API.

If you don't want to create a full Constructor for all types, overwriting the mapping constructor in SafeConstructor should do the job:

import sys
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = """\
build:
  step: 'step1'

build:
  step: 'step2'
"""


def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))


SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

which gives:

[('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

However it doesn't seem necessary to make step: 'step1' into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of the self.construct_object(key_node, deep=True)):

def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    keys = set()
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        if key in keys:
            break
        keys.add(key)
    else:
        data = {}  # type: Dict[Any, Any]
        yield data
        value = self.construct_mapping(node)
        data.update(value)
        return
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))

which gives:

[('build', {'step': 'step1'}), ('build', {'step': 'step2'})]

Some points:

  • Probably needless to say, this will not work with YAML merge keys (<<: *xyz)
  • If you need ruamel.yaml's round-trip capabilities (yaml = YAML()) , that will require a more complex construct_yaml_map.
  • If you want to dump the output, you should instantiate a new YAML() instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):

    yaml_out = YAML(typ='safe')
    yaml_out.dump(data, sys.stdout)
    

    which gives (with the first construct_yaml_map):

    - - build
      - - [step, step1]
    - - build
      - - [step, step2]
    
  • What doesn't work in PyYAML nor ruamel.yaml is yaml.load('file.yml'). If you don't want to open() the file yourself you can do:

    from pathlib import Path  # or: from ruamel.std.pathlib import Path
    yaml = YAML(typ='safe')
    yaml.load(Path('file.yml')
    

¹ Disclaimer: I am the author of that package.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • Mind blown! That's much more elegant than I thought it'd be. Thanks a lot for the code and good explanation! Thankfully the limitations are fine to me. – jakubka Jul 04 '17 at 17:00
  • One question: in `construct_yaml_map` is there an advantage of yeilding the `data` array instead of just returning it when it's populated? – jakubka Jul 04 '17 at 17:01
  • @jakubka Yes the `yield` is essential part of [two-step generation](https://stackoverflow.com/questions/41900782/why-does-pyyaml-use-generators-to-construct-objects/41900996#41900996) necessary for self-referential structures (i.e. those using anchors and aliases) – Anthon Jul 04 '17 at 17:39
  • Makes sense, cheers. FYI I ended up using [multidict](https://pypi.python.org/pypi/multidict) to represent the file. – jakubka Jul 04 '17 at 22:28
  • I am trying to convert this to a round-trip function that can handle both duplicate and non-duplicate keys (i.e, make all second level nodes lists), but I cannot make it work. could you help me there? or just tell me how to get "node", I then can reverse-engineer. thanks – mluerig Nov 24 '19 at 14:08
  • @mluerig please post a new question with the code (and input file) that you have, even though that is not working tag it `ruamel.yaml` and I'll get notified that there is a new question – Anthon Nov 24 '19 at 19:57
  • @Anthon Do you have any idea how to limit this constructor to specific keys, e.g. "build"? I'd like to use this approach in my application, but don't want to affect the remaining data structure. Within `construct_yaml_map()` only the value seems to be available, not its key. – Falko Apr 05 '22 at 11:06
  • @Falko you can subclass the Constructor with just the method for representing mappings changed, and include the code for checking on keys. But if it is depending on context that is going to be difficult and you are better of recursively traversing the datastructure before dumping, as you have more control over keeping track of the context that you need to decide on which keys are allowed or not. – Anthon Apr 05 '22 at 17:53
  • @Anthon Ah ok. I think I'll stick with a [more general approach](https://stackoverflow.com/a/71751051/3419103). This is rather easy to implement and only affects subtrees of the data structure with duplicate keys. – Falko Apr 06 '22 at 06:47
6

You can override how pyyaml loads keys. For example, you could use a defaultdict with lists of values for each keys:

from collections import defaultdict
import yaml


def parse_preserving_duplicates(src):
    # We deliberately define a fresh class inside the function,
    # because add_constructor is a class method and we don't want to
    # mutate pyyaml classes.
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        """Walk the mapping, recording any duplicate keys.

        """
        mapping = defaultdict(list)
        for key_node, value_node in node.value:
            key = loader.construct_object(key_node, deep=deep)
            value = loader.construct_object(value_node, deep=deep)

            mapping[key].append(value)

        return mapping

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)
Wilfred Hughes
  • 29,846
  • 15
  • 139
  • 192
2

If you can modify the input data very slightly, you should be able to do this by converting the single yaml-like file into multiple yaml documents. yaml documents can be in the same file if they're separated by --- on a line by itself, and you handily appear to have entries separated by two newlines next to each other:

with open('file.yml', 'r') as f:
    data = f.read()
    data = data.replace('\n\n', '\n---\n')

    for document in yaml.load_all(data):
        print(document)

Output:

{'build': {'step': 'step1'}}
{'build': {'step': 'step2'}}
Simon Fraser
  • 2,758
  • 18
  • 25
  • This approach will only work if the duplicate keys are all in a mapping that is in the top-level. Why the comment `# should really use os.path.sep`, you are not doing anything with filenames? – Anthon Jul 04 '17 at 13:44
  • Fair point, I was basing it on the example given. And `os.path.sep` I blame on lack of caffeine ;) – Simon Fraser Jul 04 '17 at 13:58
  • As a quick fix it seems ok, just have to be aware of the limitations. I used to go for coffee to a place a few doors down from Heffers bookshop when I was around (back in the 80's), can't remember its name though. – Anthon Jul 04 '17 at 14:22
  • 1
    Good tip, but won't work in my case unfortunately as I have duplicates in the subsections. I should have made it clear in the example, sorry! – jakubka Jul 04 '17 at 16:05
1

Here is an alternative implementation based on Anthon's answer and ruamel.yaml. It is rather generic and uses lists for duplicates, while other entries are left unchanged.

from collections import Counter
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = '''
a: 1
b: 2
b: 2
'''

def construct_yaml_map(self, node):
    data = {}
    yield data
    keys = [self.construct_object(node, deep=True) for node, _ in node.value]
    vals = [self.construct_object(node, deep=True) for _, node in node.value]
    key_count = Counter(keys)
    for key, val in zip(keys, vals):
        if key_count[key] > 1:
            if key not in data:
                data[key] = []
            data[key].append(val)
        else:
            data[key] = val

SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

Output:

{'a': 1, 'b': [2, 2]}

The same is possible with the pyyaml package (inspired by Wilfred Hughes' answer):

from collections import Counter
import yaml

yaml_str = '''
a: 1
b: 2
b: 2
'''

def parse_preserving_duplicates(src):
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
        vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
        key_count = Counter(keys)
        data = {}
        for key, val in zip(keys, vals):
            if key_count[key] > 1:
                if key not in data:
                    data[key] = []
                data[key].append(val)
            else:
                data[key] = val
        return data

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)

print(parse_preserving_duplicates(yaml_str))

Output:

{'a': 1, 'b': [2, 2]}
Falko
  • 17,076
  • 13
  • 60
  • 105