1

I am working on getting all text that exists in several .yaml files placed into a new singular YAML file that will contain the English translations that someone can then translate into Spanish.

Each YAML file has a lot of nested text. I want to print the full 'path', aka all the keys, along with the value, for each value in the YAML file. Here's an example input for a .yaml file that lives in the myproject.section.more_information file:

default: 
    heading: Here’s A Title
    learn_more:
        title: Title of Thing
        url: www.url.com
        description: description
        opens_new_window: true

and here's the desired output:

myproject.section.more_information.default.heading: Here’s a Title
myproject.section.more_information.default.learn_more.title: Title of Thing
mproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description: description
myproject.section.more_information.default.learn_more.opens_new_window: true

This seems like a good candidate for recursion, so I've looked at examples such as this answer

However, I want to preserve all of the keys that lead to a given value, not just the last key in a value. I'm currently using PyYAML to read/write YAML.

Any tips on how to save each key as I continue to check if the item is a dictionary and then return all the keys associated with each value?

Community
  • 1
  • 1
swellactually
  • 169
  • 1
  • 6

3 Answers3

1

What you're wanting to do is flatten nested dictionaries. This would be a good place to start: Flatten nested Python dictionaries, compressing keys

In fact, I think the code snippet in the top answer would work for you if you just changed the sep argument to ..

edit:

Check this for a working example based on the linked SO answer http://ideone.com/Sx625B

import collections

some_dict = {
    'default': {
        'heading': 'Here’s A Title',
        'learn_more': {
            'title': 'Title of Thing',
            'url': 'www.url.com',
            'description': 'description',
            'opens_new_window': 'true'
        }
    }
}

def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

results = flatten(some_dict, parent_key='', sep='.')
for item in results:
    print(item + ': ' + results[item])

If you want it in order, you'll need an OrderedDict though.

Community
  • 1
  • 1
Michael
  • 141
  • 1
  • 9
  • Super, thanks for sending. I took a look and the output looks great. Unfortunately I have to step away from my computer but I am going to spend some more time looking through it this weekend. – swellactually Jul 30 '16 at 01:01
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes, this has even happened with links to other StackOverflow answers. – Anthon Jul 30 '16 at 05:56
  • 1
    Gotcha @Anthon, I'll fix it. – Michael Jul 30 '16 at 06:11
0

Keep a simple list of strings, being the most recent key at each indentation depth. When you progress from one line to the next with no change, simply change the item at the end of the list. When you "out-dent", pop the last item off the list. When you indent, append to the list.

Then, each time you hit a colon, the corresponding key item is the concatenation of the strings in the list, something like:

'.'.join(key_list)

Does that get you moving at an honorable speed?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Smart design. I have to step away from the computer so won't be able to try it now but I like the approach, thank you! – swellactually Jul 30 '16 at 01:09
  • Using a list of strings makes things overly complex. You have to use recursion anyway digging into the nested dictionaries, so just hand in the current path "prefix" and the recursion takes care of appending/popping that you suggest to do by hand. – Anthon Jul 30 '16 at 05:49
0

Walking over nested dictionaries begs for recursion and by handing in the "prefix" to "path" this prevents you from having to do any manipulation on the segments of your path (as @Prune) suggests.

There are a few things to keep in mind that makes this problem interesting:

  • because you are using multiple files can result in the same path in multiple files, which you need to handle (at least throwing an error, as otherwise you might just lose data). In my example I generate a list of values.
  • dealing with special keys (non-string (convert?), empty string, keys containing a .). My example reports these and exits.

Example code using ruamel.yaml ¹:

import sys
import glob
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq
from ruamel.yaml.compat import string_types, ordereddict

class Flatten:
    def __init__(self, base):
        self._result = ordereddict() # key to list of tuples of (value, comment)
        self._base = base

    def add(self, file_name):
        data = ruamel.yaml.round_trip_load(open(file_name))
        self.walk_tree(data, self._base)

    def walk_tree(self, data, prefix=None):
        """
        this is based on ruamel.yaml.scalarstring.walk_tree
        """
        if prefix is None:
            prefix = ""
        if isinstance(data, dict):
            for key in data:
                full_key = self.full_key(key, prefix)
                value = data[key]
                if isinstance(value, (dict, list)):
                    self.walk_tree(value, full_key)
                    continue
                # value is a scalar
                comment_token = data.ca.items.get(key)
                comment = comment_token[2].value if comment_token else None
                self._result.setdefault(full_key, []).append((value, comment))
        elif isinstance(base, list):
            print("don't know how to handle lists", prefix)
            sys.exit(1)

    def full_key(self, key, prefix):
        """
        check here for valid keys
        """
        if not isinstance(key, string_types):
            print('key has to be string', repr(key), prefix)
            sys.exit(1)
        if '.' in key:
            print('dot in key not allowed', repr(key), prefix)
            sys.exit(1)
        if key == '':
            print('empty key not allowed', repr(key), prefix)
            sys.exit(1)
        return prefix + '.' + key

    def dump(self, out):
        res = CommentedMap()
        for path in self._result:
            values = self._result[path]
            if len(values) == 1: # single value for path
                res[path] = values[0][0]
                if values[0][1]:
                    res.yaml_add_eol_comment(values[0][1], key=path)
                continue
            res[path] = seq = CommentedSeq()
            for index, value in enumerate(values):
                seq.append(value[0])
                if values[0][1]:
                    res.yaml_add_eol_comment(values[0][1], key=index)


        ruamel.yaml.round_trip_dump(res, out)


flatten = Flatten('myproject.section.more_information')
for file_name in glob.glob('*.yaml'):
    flatten.add(file_name)
flatten.dump(sys.stdout)

If you have an additional input file:

default:
    learn_more:
        commented: value  # this value has a comment
        description: another description

then the result is:

myproject.section.more_information.default.heading: Here’s A Title
myproject.section.more_information.default.learn_more.title: Title of Thing
myproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description:
- description
- another description
myproject.section.more_information.default.learn_more.opens_new_window: true
myproject.section.more_information.default.learn_more.commented: value  # this value has a comment

Of course if your input doesn't have double paths, your output won't have any lists.

By using string_types and ordereddict from ruamel.yaml makes this Python2 and Python3 compatible (you don't indicate which version you are using).

The ordereddict preserves the original key ordering, but this is of course dependent on the processing order of the files. If you want the paths sorted, just change dump() to use:

        for path in sorted(self._result):

Also note that the comment on the 'commented' dictionary entry is preserved.


¹ ruamel.yaml is a YAML 1.2 parser that preserves comments and other data on round-tripping (PyYAML does most parts of YAML 1.1). Disclaimer: I am the author of ruamel.yaml

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • This is a very well-thought out solution to the many edge cases my input might have, thank you. I especially like that it preserves comments. – swellactually Aug 01 '16 at 16:31