How to replace many identical values in a YAML file

Question

I am currently building a python application that uses YAML configs. I generate the YAML config file by using other YAML files. I have a "template" YAML, which defines the basic structure I want in the YAML file the app uses, and then many different "data" YAMLs that fill in the template to spin the application's behavior a certain way. So for example say I had 10 "data" YAMLs. Depending on where the app is being deployed, 1 "data" YAML is chosen, and used to fill out the "template" YAML. The resulting filled out YAML is what the application uses to run. This saves me a ton of work. I have run into a problem with this method though. Say I have a template YAML that looks like this:

id: {{id}}
endpoints:
  url1: https://website.com/{{id}}/search
  url2: https://website.com/foo/{{id}}/get_thing
  url3: https://website.com/hello/world/{{id}}/trigger_stuff
foo:
  bar:
    deeply:
      nested: {{id}}

Then somewhere else, I have like 10 "data" YAMLs each with a different value for {{id}}. I cant seem to figure out an efficient way to replace all these {{id}} occurrences in the template. I am having a problem because sometimes the value to be substituted is a substring of a value I want to mostly keep, or the occurrences are very far apart from each other in the hierarchy, making looping solutions inefficient. My current method for generating the config file using template+data looks something like this in python:

import yaml
import os

template_yaml = os.path.abspath(os.path.join(os.path.dirname(__file__), 'template.yaml'))
# In this same folder you would find flavor2, flavor3, flavor4, etc, lets just use 1 for now
data_yaml = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data_files', 'flavor1.yaml'))
# This is where we dump the filled out template the app will actually use
output_directory = os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))

with open(template_yaml, 'r') as template:
    try:
        loaded_template = yaml.load(template)  # Load the template as a dict
        with open(data_yaml , 'r') as data:
            loaded_data= yaml.load(data)  # Load the data as a dict
        # From this point on I am basically just setting individual keys from "loaded_template" to values in "loaded_data"
        # But 1 at a time, which is what I am trying to avoid:
        loaded_template['id'] = loaded_data['id']
        loaded_template['endpoints']['url1'] = loaded_template['endpoints']['url1'].format(loaded_data['id'])
        loaded_template['foo']['bar']['deeply']['nested'] = loaded_data['id']

Any idea on how to go through and change all the {{id}} occurrences faster?

For a single yaml file, is the `id` that you want to replace always the same? — willk, Nov 29 '18 at 16:49
Yes. Every {{id}} occurrence is replaced by the same value X — Rosey, Nov 29 '18 at 17:20
Why are you using `.load()` and not `.safe_load()` the former is documented to be potentially unsafe, and the latter is sufficient. — Anthon, Nov 29 '18 at 18:23

score 4 · Accepted Answer · answered Nov 29 '18 at 19:30

You are proposing to us PyYAML, but it is not very suited for doing updates on YAML files. In that process, if it can load your file in the first place, you loose your mapping key order, any comments you have in the file, merges get expanded, and any special anchor names get lost in translation. Apart from that PyYAML cannot deal with the latest YAML spec (released 9 years ago), and it can only handle simple mapping keys.

There are two main solutions:

You can use substitution on the raw file
You an use ruamel.yaml and recurse into the data structure

Substitution

If you use substition you can do that in much more efficient way than the line by line substittution that @caseWestern proposes. But most of all, you should harden the scalars in which these substitutions take place. Currently you have plain scalars (i.e. flow style scalars without quotes) and those tend to break if you insert things like #, : and other syntactically significant elements.

In order to prevent that from happening rewrite your input file to use block style literal scalars:

id: {{id}}
endpoints:
  url1: |-
    https://website.com/{{id}}/search
  url2: |-
    https://website.com/foo/{{id}}/get_thing
  url3: |-
    https://website.com/hello/world/{{id}}/trigger_stuff
foo:
  bar:
    deeply:
      nested: |-
        {{id}}

If the above is in alt.yaml you can do:

val = 'xyz'

with open('alt.yaml') as ifp:
    with open('new.yaml', 'w') as ofp:
       ofp.write(ifp.read().replace('{{id}}', val))

to get:

id: xyz
endpoints:
  url1: |-
    https://website.com/xyz/search
  url2: |-
    https://website.com/foo/xyz/get_thing
  url3: |-
    https://website.com/hello/world/xyz/trigger_stuff
foo:
  bar:
    deeply:
      nested: |-
        xyz

ruamel.yaml

Using ruamel.yaml (disclaimer: I am the author of that package), you don't have to worry about breaking the input by syntactically significant replacement texts. If you do so, then the output will automatically be correctly quoted. You do have to take care that your input is valid YAML, and by using something like {{ that, at the beginning of a node indicates two nested flow-style mappings, you'll run into trouble.

The big advantage here is that your input file is loaded, and it is checked to be correct YAML. But this is significantly slower than file level substitution.

So if your input is in.yaml:

id: <<id>>  # has to be unique
endpoints: &EP
  url1: https://website.com/<<id>>/search
  url2: https://website.com/foo/<<id>>/get_thing
  url3: https://website.com/hello/world/<<id>>/trigger_stuff
foo:
  bar:
    deeply:
      nested: <<id>>
    endpoints: *EP
    [octal, hex]: 0o123, 0x1F

You can do:

import sys
import ruamel.yaml

def recurse(d, pat, rep):
    if isinstance(d, dict):
        for k in d:
            if isinstance(d[k], str):
                d[k] = d[k].replace(pat, rep)
            else:
               recurse(d[k], pat, rep)
    if isinstance(d, list):
        for idx, elem in enumerate(d):
            if isinstance(elem, str):
                d[idx] = elem.replace(pat, rep)
            else:
               recurse(d[idx], pat, rep)


yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
with open('in.yaml') as fp:
    data = yaml.load(fp)
recurse(data, '<<id>>', 'xy: z')  # not that this makes much sense, but it proves a point
yaml.dump(data, sys.stdout)

which gives:

id: 'xy: z' # has to be unique
endpoints: &EP
  url1: 'https://website.com/xy: z/search'
  url2: 'https://website.com/foo/xy: z/get_thing'
  url3: 'https://website.com/hello/world/xy: z/trigger_stuff'
foo:
  bar:
    deeply:
      nested: 'xy: z'
    endpoints: *EP
    [octal, hex]: 0o123, 0x1F

Please note:

The values that have the replacement pattern, are automatically quoted on dump, to deal with the : + space that would otherwise indicate a mapping and break the YAML
the YAML.load() method, contrary to PyYAML's load function, is safe (i.e. cannot execute arbitrary Python by manipulating the input file.
The comment, the octal and hexadecimal integer and the alias name is preserved.
PyYAML cannot load the file in.yaml at all, although it is valid YAML
The above recurse, only changes the input mapping values, if you want to do the keys as well, you either have to pop and reinsert all the keys (even if not changed), to keep the original order, or you need to use enumerate and d.insert(position, key, value). If you have merges, you also cannot just walk over the keys, you'll have to walk over the non-merged keys of the "dict".

Idk who downvoted this. its a completely different approach but still seeks to address my core issue. Thanks for taking the time to write this Anthon. I like some of the features you outlined but I have a concern with the proposed "recurse" function. This function takes a pattern to recognize and replace. So I would have to call replace for every different set of `{{things}}` yeah? Each call of recurse seems to scan the entire file. I appreciate the safety concern and hesitance to run arbitrary python functions but in my posted solution using python functions seems like less work than this no? — Rosey, Nov 30 '18 at 14:10
@rosey The recurse walks over the loaded data, not over the file, and does so once. But in case you want to do multiple substitutions, you should change the function to take a dict as parameter (instead of `pat`, `rep`) and then instead of the single replace repeatedly try to replace each key in that dict with its value. That way you still walk over the structure only once, and do all of the replacements. Let me know if you want my answer updated that way. (About the downvote: I do regularly comment on bad questions or incorrect answers, so I probably stepped on someone's toes) — Anthon, Nov 30 '18 at 14:48
Yeah actually I just did that. Lightning fast, I was worried about nothing. I will be using this method and library for my YAML from now on thanks! — Rosey, Nov 30 '18 at 15:12

willk · Answer 2 · 2018-11-29T17:38:20.973

2

If the id is the same in every location for a single yaml file, then you could just read in the template as plain text and use string replacement line by line.

new_file = []

# New id for replacement (from loaded file)
id_ = '123'

# Open template file 
with open('template.yaml', 'r') as f:
    # Iterate through each line
    for l in f:
        # Replace every {{id}} occurrence
        new_file.append(l.replace('{{id}}', id_))

# Save the new file
with open('new_file.yaml', 'w') as f:
    for l in new_file:
        f.write(l)

This will replace {{id}} with the same id_ everywhere in the file and will not change any of the formatting.

edited Nov 29 '18 at 17:38

answered Nov 29 '18 at 17:09

willk

3,727
2
27
44

I thought about doing something like this. In your example, isn't that for loop just searching the top level of the loaded dict though? That was why I decided against this approach. What do you do for the deeply nested {{id}}? – Rosey Nov 29 '18 at 17:23
caseWestern's method ignores the fact that this is a YAML file. He's reading the file line by line as plain text. – axblount Nov 29 '18 at 17:26
OHHHHHH. Is that safe? That seems like it would solve my problem if that's safe. – Rosey Nov 29 '18 at 17:27
The only thing this does is replace `{{id}}` with the `id_`. It shouldn't change any of the formatting. Nonetheless, you probably want to test it on a few files at first. It seemed like the simplest solution! – willk Nov 29 '18 at 17:41
@Rosey If you load your YAML with PyYAML's `safe_load()` function, then you don't have to worry about the substitution being safe. You can of course easily **break** the YAML with the replacement, because you are using plain scalars, which are rather fragile (e.g. when the replacement contains space + `#` or `:` + space). If you were using literal block style scalars you would not have that problem – Anthon Nov 29 '18 at 18:28

Rosey · Answer 3 · 2018-11-29T18:17:50.177

1

YAML has built in "anchors" that you can make and reference kind of like variables. It wasn't obvious to me that these are actually substituting their values where referenced because you only see the result AFTER you parse a YAML. Code is shamelessly stolen from a Reddit post covering a similar topic:

# example.yaml
params: &params
  PARAM1: &P1 5
  PARAM2: &P2 "five"
  PARAM3: &P3 [*P1, *P2]

data:
  <<: *params
  more:
    - *P3
    - *P2

ff

# yaml.load(example) =>
{
'params': {
    'PARAM1': 5, 
    'PARAM2': 'five', 
    'PARAM3': [5, 'five']
},
'data': {
    'PARAM1': 5,
    'PARAM2': 'five',
    'PARAM3': [5, 'five'],
    'more': [[5, 'five'], 'five']
}
}

And this post here on SO is how I think you can use anchors as a substring (assuming you are using python)

edited Nov 29 '18 at 18:17

answered Nov 29 '18 at 17:51

Rosey

739
1
12
27

I am not sure why you think you can use anchors as substrings, you don't give an example and you'll be hard pressed to find one. You cannot use aliases as a substring of a scalar, but you have some tagged object that combines its subnodes, some, or all, of which are alias nodes to previously defined scalars. This is entirely doable, but not very readable. – Anthon Nov 29 '18 at 18:32
Using YAML alone, you're right, you cant use anchors as a substring. If you checked the SO post I linked, the 2nd answer shows how you can if you leverage some python specific markers + functions. In practice it looks something like this: `substring: !join [*P1, 333333333]` HUGE * here is that !join is running a python function when parsed – Rosey Nov 30 '18 at 13:52
I had not checked that link, sorry. But that `!join` is a tagged object, that referred in my (superfluous) comment. It is (IMO) just not very readable to do `!join [https://website.com/, *id, /search]` instead of `https://website.com/<>/search`. And apart from that you need to specify your anchored value (`&id`) in the file and updating that by loading/dumping the YAML document is non-trivial with ruamel.yaml and IIRC impossible with PyYAML. – Anthon Nov 30 '18 at 14:42

How to replace many identical values in a YAML file

3 Answers3

Substitution

ruamel.yaml