0

I have several hundreds YAML files which needs to be updated from time to time.

Before update:

sss:
  - ccc:
      brr: 'mmm'
      jdk: 'openjdk8'
  - bbb:
      brr: 'rel/bbb'
      jdk: 'openjdk8'
  - aaa:
      brr: 'rel/aaa'
      jdk: 'openjdk7'

After update:

sss:
  - ddd: 
      brr: 'mmm'
      jdk: 'openjdk8'
  - ccc:
      brr: 'rel/ccc'
      jdk: 'openjdk8'
  - bbb:
      brr: 'rel/bbb'
      jdk: 'openjdk8'
  - aaa:
      brr: 'rel/aaa'
      jdk: 'openjdk7'
  1. For any occurrence of the following pattern in any file:
sss:
  - ccc:
      brr: 'mmm'
  1. Substitute and modify the above pattern to replace 'mmm' with 'rel/ccc':
  - ccc:
      brr: 'rel/ccc'
  1. Create new sub-string (multi-lined) in the format:
  - new: 
      brr: 'new-mmm'
      jdk: 'openjdk8'
  1. Combined 2. and 3. and replace the original file with:
sss:
  - new: 
      brr: 'mmm'
      jdk: 'openjdk8'
  - ccc:
      brr: 'rel/ccc'
      jdk: 'openjdk8'

For instance we need to update the above file to look like preserving the white spaces/tabs in each line, since formatting is important for YAML.

I have already tried this with PyYAML and does not work due to complexities in the syntax. Can this be done by capturing the white spaces using awk, sed ?

askb
  • 6,501
  • 30
  • 43
  • You may have a look at [SO: Shell scripting - split xml into multiple files](http://stackoverflow.com/questions/42625786/shell-scripting-split-xml-into-multiple-files/42626222#42626222). There you can see how variables are used to "switch the awk parsing" between multiple modes. – Scheff's Cat Mar 10 '17 at 06:53
  • you need to give a bit more info about replace/add. adding with second level no existing is easy to understand. ex:you place it in begin of section but could it be added at the end of the same section ? – NeronLeVelu Mar 10 '17 at 07:02
  • Can we assume that the removal of `sss:` from example 2 is a mistake? – Anthon Mar 10 '17 at 07:51
  • What do you mean by capturing the white spaces? Is the indentation level of `brr: 'mmm'` enough to identify all the positions that need to be replaced? – Anthon Mar 10 '17 at 07:56
  • @Anthon: sss can be removed, but that has to be used as a start point of the search string. The indentation could differ therefore, I would want to capture the spaces/tabs and re-introduce them in the updated multi lined string. – askb Mar 10 '17 at 09:17

2 Answers2

2

Try something like this awk program:

/sss:/ { sss = 1; }
/- ccc:/ { ccc = 1; ind = substr($0, 1, index($0, "-")-1); next; } # don't print
$1 == "brr:" && $2 == "'mmm'" {
    if (sss && ccc) {
        print ind "- ddd:";
        print ind "    brr: 'mmm'";
        print ind "    jdk: 'openjdk8'";
        print ind "- ccc:";
        print ind "    brr: 'rel/ccc'";
        sss = 0; ccc = 0;
    }
    next;
}
{ print }

The first rule is used to mark entering the sss block, the second to mark the ccc block, and additionally to record the indentation depth. The third rule adds the new and modified data, indented according to the depth recorded, then exits the sss and ccc blocks. The final rule prints the line just read. The next statement in the second and third rule prevent all following rules from being applied.

Michael Vehrs
  • 3,293
  • 11
  • 10
1

Parsing structured data, whether it is YAML, HTML, XML or CSV, with regular expressions alone only work in a tiny subset of possible cases. With YAML multi-line scalars, dealing with flow-style and block-style etc. in a generic way is virtually impossible. If that were not the case, someone would already have written a full YAML parser in awk. (There is nothing wrong with awk, it is just not the right tool for processing YAML).

That doesn't mean you cannot use regular expressions to find particular elements, you just need a bit of preparation:

import sys
import re
import ruamel.yaml

yaml_str = """\
sss:
  - ccc:
      brr: 'mmm'
      jdk: 'openjdk8'
  - bbb:
      brr: 'rel/bbb'
      jdk: 'openjdk8'
  - aaa:
      brr: 'rel/aaa'
      jdk: 'openjdk7'
"""


class Paths:
    def __init__(self, data, sep=':'):
        self._sep = sep
        self._data = data

    def walk(self, data=None, prefix=None):
        if data is None:
            data = self._data
        if prefix is None:
            prefix = []
        if isinstance(data, dict):
            for idx, k in enumerate(data):
                path_list = prefix + [k]
                yield self._sep.join([str(q) for q in path_list]), path_list, idx, data[k]
                for x in self.walk(data[k], path_list):
                    yield x
        elif isinstance(data, list):
            for idx, k in enumerate(data):
                path_list = prefix + [idx]
                yield self._sep.join([str(q) for q in path_list]), path_list, idx, k
                for x in self.walk(k, path_list):
                    yield x

    def set(self, pl, val):
        pl = pl[:]
        d = self._data
        while(len(pl) > 1):
            d = d[pl.pop(0)]
        d[pl[0]] = val

    def insert_in_list(self, pl, idx, val):
        pl = pl[:]
        d = self._data
        while(len(pl) > 1):
            d = d[pl.pop(0)]
        d.insert(idx, val)


data = ruamel.yaml.round_trip_load(yaml_str, preserve_quotes=True)
paths = Paths(data)
pattern = re.compile('sss:.*:c.*:brr$')
# if you are going to insert/delete use list(paths.walk())
for p, pl, idx, val in list(paths.walk()):
    print('path', p)
    if not pattern.match(p):
        continue
    paths.set(pl, ruamel.yaml.scalarstring.SingleQuotedScalarString('rel/ccc'))
    paths.insert_in_list(pl[:-2], idx, {'new': {
        'brr': ruamel.yaml.scalarstring.SingleQuotedScalarString('mmm'),
        'jdk': ruamel.yaml.scalarstring.SingleQuotedScalarString('openjdk8')
        }})

print('----------')

ruamel.yaml.round_trip_dump(data, sys.stdout)

The output for that is:

path sss
path sss:0
path sss:0:ccc
path sss:0:ccc:brr
path sss:0:ccc:jdk
path sss:1
path sss:1:bbb
path sss:1:bbb:brr
path sss:1:bbb:jdk
path sss:2
path sss:2:aaa
path sss:2:aaa:brr
path sss:2:aaa:jdk
----------
sss:
- new:
    brr: 'mmm'
    jdk: 'openjdk8'
- ccc:
    brr: 'rel/ccc'
    jdk: 'openjdk8'
- bbb:
    brr: 'rel/bbb'
    jdk: 'openjdk8'
- aaa:
    brr: 'rel/aaa'
    jdk: 'openjdk7'
  1. The printing of the "paths" is not necessary, but here to get a better idea of what is going on.

  2. The SingleQuotedScalarString is necessary to get the superfluous quotes around the string scalars in the YAML output

  3. The dict subclass, into which YAML mappings are loaded by ruamel.yaml, supports .insert(index, key, val) for Python 2.7 and Python 3.5 and later, so you can insert in specific positions of a mapping as well.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • 1
    @askb The top of the python file had gone missing during copy-paste. I re-added that. – Anthon Mar 12 '17 at 10:22
  • I used pyaml before while I got stuck in the same issue. Here is an snippet: http://paste.openstack.org/show/602443/ This does not read yaml lines starting with special chars(`!`): ruamel.yaml.constructor.ConstructorError: could not determine a constructor for the tag '!include-raw:' Is there a workaround for this ? – askb Mar 13 '17 at 03:56
  • @askb That is a while different issue. You'll have to provide a constructor for the objects for which tags like `!include-raw` that are in your file, and a representer that dumps the tag out again. Just make it a subclass of dict, but that is new and different question and not something to address in comments. – Anthon Mar 13 '17 at 05:42
  • if the exclamation mark is in quotes there is nothing you need to do. The error message is from the unquoted exclamation mark in your file (i.e. the value for the key `shell`) – Anthon Mar 13 '17 at 07:56
  • It should be doable to make a generic constructor/representer, I already did something like that for pickle. I am very busy right now but I'll take a look at it tonight or later this week. – Anthon Mar 13 '17 at 10:27