1

I'm using PyYAML 6.0 with Python 3.9.

In order, I am trying to...

  1. Create a YAML list
  2. Embed this list as a multi-line string in another YAML object
  3. Replace this YAML object in an existing document
  4. Write the document back, in a format that will pass YAML 1.2 linting

I have the process working, apart from the YAML 1.2 requirement, with the following code:

import yaml

def str_presenter(dumper, data):
    """configures yaml for dumping multiline strings
    Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data"""
    if data.count('\n') > 0:  # check for multiline string
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)
yaml.representer.SafeRepresenter.add_representer(
    str, str_presenter) 

class DoYamlStuff:
    def post_renderers(images):
        return yaml.dump([
            {
                "op": "replace",
                "path": "/spec/postRenderers",
                "value": [
                    {
                        "kustomize": {
                            "images": images
                        }
                    }
                ]
            }])

    @classmethod
    def images_patch(cls, chart, images, ecr_url):
        return {
            "target": {
                "kind": "HelmRelease",
                "name": chart,
                "namespace": chart
            },
            "patch": cls.post_renderers([x.patch(ecr_url) for x in images])

This produces something like this:

- patch: |
    - op: replace
      path: /spec/postRenderers
      value:
      - kustomize:
          images:
          - name: nginx:latest
            newName: 12345678910.dkr.ecr.eu-west-1.amazonaws.com/nginx
            newTag: latest
  target:
    kind: HelmRelease
    name: nginx
    namespace: nginx

As you can see, that's mostly working. Valid YAML, does what it needs to, etc.

Unfortunately... it doesn't indent the list item by 2 spaces, so the YAML linter in our repository's pre-commit then adjusts everything. Makes the repo messy, and causes PRs to regularly include changes that aren't relevant.

I then set out to implement this PrettyDumper class from StackOverflow. This reversed the effects - my indentation is now right, but my scalars aren't working at all:

  - patch: "- op: replace\n  path: /spec/postRenderers\n  value:\n    - kustomize:\n\
      \        images:\n          - name: nginx:latest\n           \
      \ newName: 793961818876.dkr.ecr.eu-west-1.amazonaws.com/nginx\n        \
      \    newTag: latest\n"
    target:
      kind: HelmRelease
      name: nginx
      namespace: nginx

I have tried to merge the str_presenter function with the PrettyDumper class, but the scalars still don't work:

import yaml.emitter
import yaml.serializer
import yaml.representer
import yaml.resolver


class IndentingEmitter(yaml.emitter.Emitter):
    def increase_indent(self, flow=False, indentless=False):
        """Ensure that lists items are always indented."""
        return super().increase_indent(
            flow=False,
            indentless=False,
        )


class PrettyDumper(
    IndentingEmitter,
    yaml.serializer.Serializer,
    yaml.representer.Representer,
    yaml.resolver.Resolver,
):
    def __init__(
        self,
        stream,
        default_style=None,
        default_flow_style=False,
        canonical=None,
        indent=None,
        width=None,
        allow_unicode=None,
        line_break=None,
        encoding=None,
        explicit_start=None,
        explicit_end=None,
        version=None,
        tags=None,
        sort_keys=True,
    ):
        IndentingEmitter.__init__(
            self,
            stream,
            canonical=canonical,
            indent=indent,
            width=width,
            allow_unicode=allow_unicode,
            line_break=line_break,
        )
        yaml.serializer.Serializer.__init__(
            self,
            encoding=encoding,
            explicit_start=explicit_start,
            explicit_end=explicit_end,
            version=version,
            tags=tags,
        )
        yaml.representer.Representer.__init__(
            self,
            default_style=default_style,
            default_flow_style=default_flow_style,
            sort_keys=sort_keys,
        )
        yaml.resolver.Resolver.__init__(self)
        
        yaml.add_representer(str, self.str_presenter)
        yaml.representer.SafeRepresenter.add_representer(
            str, self.str_presenter) 

    def str_presenter(self, data):
        print(data)
        """configures yaml for dumping multiline strings
        Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data"""
        if data.count('\n') > 0:  # check for multiline string
            return self.represent_scalar('tag:yaml.org,2002:str', data, style='|')
        return self.represent_scalar('tag:yaml.org,2002:str', data)

If I could merge these two approaches into the PrettyDumper class, I think it would do everything I require. Can anyone point me in the right direction?

turbonerd
  • 1,234
  • 4
  • 27
  • 63

1 Answers1

1

If you need to pass your output through YAML 1.2 linting, you should not use PyYAML as it only supports (a subset of) YAML 1.1.

ruamel.yaml can handle more, e.g using a sequence as a mapping key, something that PyYAML cannot handle at all, although it is valid YAML 1.1. Apart from that it supports, and defaults to, YAML 1.2 loading/dumping (disclaimer: I am the author of that package).

Over the years ruamel.yaml's round-trip mode, which was originally built to preserve comments, has been extended and now handles superfluous quotes, anchor/alias name preservation, different format string scalars, integers and float etc. You can use its underlying technology to easily get what you want, without mucking with representers:

import sys
import io
import ruamel.yaml

images = [
   dict(name='nginx:latest', newName='12345678910.dkr.ecr.eu-west-1.amazonaws.com/nginx', newTag='latest'),
]
chart = 'nginx'

def data_as_literal_scalar(d):
    """dump a data structure d and make it a literal scalar string for further dumping"""
    yaml = ruamel.yaml.YAML()
    yaml.indent(sequence=4, offset=2)  # this indents even the root sequence by 2 extra positions
    buf = io.StringIO()
    yaml.dump(d, buf)
    v = ''.join([x[2:] for x in buf.getvalue().splitlines(True)])  # strip extra positions
    return ruamel.yaml.scalarstring.LiteralScalarString(v)

data = [dict(patch=data_as_literal_scalar([{
                                   "op": "replace",
                                   "path": "/spec/postRenderers",
                                   "value": [
                                       {
                                           "kustomize": {
                                               "images": images
                                           }
                                       }
                                   ]
                                 }]),
    target={
                "kind": "HelmRelease",
                "name": chart,
                "namespace": chart
            },
)]

yaml = ruamel.yaml.YAML()
yaml.dump(data, sys.stdout)

which gives:

- patch: |
    - op: replace
      path: /spec/postRenderers
      value:
        - kustomize:
            images:
              - name: nginx:latest
                newName: 12345678910.dkr.ecr.eu-west-1.amazonaws.com/nginx
                newTag: latest
  target:
    kind: HelmRelease
    name: nginx
    namespace: nginx
Anthon
  • 69,918
  • 32
  • 186
  • 246
  • 1
    Thanks Anthon. I wish I had known this before I set out on this little venture :) I've previously used ruamel too; I actually have a question on SO about it from years ago! I tried to get ruamel working tonight, before I saw your comment, but I couldn't work out how to do the scalars. I will give your code a go tomorrow, and if I can get it working, happily move over. Thanks for your efforts with this package by the way. – turbonerd Mar 20 '23 at 21:15
  • Oh, one little side question if you don't mind, as I'm a Python novice - would you initialise `yaml = ruamel.yaml.YAML()` outside of a `Class`? Or as parts of its `__init__` or something? – turbonerd Mar 20 '23 at 21:17
  • 1
    Working as expected. Thanks again :) – turbonerd Mar 20 '23 at 23:09
  • 1
    Please note I have two instances of `ruamel.yaml.YAML()` only one of them has non-default indent for sequences. When I have a class that needs a `YAML()` instance I usually make a `yaml` method that is a property, that tries to return `self._yaml` and if it fails on AttributeError, does `self._yaml = ruamel.yaml.YAML()`, sets additionals parameters like indentation and then returns self._yaml. This initialises `YAML()` only ones and I did some tests on this being the least overhead time wise when `self.yaml` is used often in the class. I don't think I have used that code in answer yet. – Anthon Mar 21 '23 at 05:47
  • Post a real question is the comment is unclear. – Anthon Mar 21 '23 at 05:48