59

I've got an object with a short string attribute, and a long multi-line string attribute. I want to write the short string as a YAML quoted scalar, and the multi-line string as a literal scalar:

my_obj.short = "Hello"
my_obj.long = "Line1\nLine2\nLine3"

I'd like the YAML to look like this:

short: "Hello"
long: |
  Line1
  Line2
  Line3

How can I instruct PyYAML to do this? If I call yaml.dump(my_obj), it produces a dict-like output:

{long: 'line1

    line2

    line3

    ', short: Hello}

(Not sure why long is double-spaced like that...)

Can I dictate to PyYAML how to treat my attributes? I'd like to affect both the order and style.

Anthon
  • 69,918
  • 32
  • 186
  • 246
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662

6 Answers6

71

Falling in love with @lbt's approach, I got this code:

import yaml

def str_presenter(dumper, data):
  if len(data.splitlines()) > 1:  # check for multiline string
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)

# to use with safe_dump:
yaml.representer.SafeRepresenter.add_representer(str, str_presenter)

It makes every multiline string be a block literal.

I was trying to avoid the monkey patching part. Full credit to @lbt and @J.F.Sebastian.

jwatson0
  • 150
  • 5
xenosoz
  • 904
  • 7
  • 9
  • 2
    nice approach that allows to avoid tagging input strings explicitly. You could use `is_multiline = lambda s: len(s.splitlines()) > 1` that recognizes Unicode newlines automatically and it doesn't return true for a single line. – jfs Nov 26 '15 at 17:49
  • 1
    @J.F.Sebastian It's nice to see that nice trick. Now the code looks much better. Many thanks! – xenosoz Nov 30 '15 at 23:22
  • 2
    hmm, the `style='|'` doesn't seem to affect pyyaml – Jason S Oct 13 '17 at 20:41
  • @jfs Note that `splitlines` ignores final newline characters, so if you want single-line strings with a final newline to be processed with `|` style, you have to check for that separately. Or if `\n` is the only type of linebreak in your dataset, simply use `len(data.split('\n')) > 1` – oulenz Jan 16 '18 at 10:26
  • @oulenz: "multiline" means "more than one" string. `.split('\n')` is wrong here. `'\n'` is *one* line, not two. – jfs Jan 16 '18 at 10:51
  • 1
    @jfs Yes but the pyyaml emitter will print `test\n` on two lines surrounded by single quotes, so if the goal of this exercise is to use block style rather than quotes if a string contains newlines, then we have to handle this case as if it were multiline (whether that's technically correct or not). – oulenz Jan 16 '18 at 11:02
  • 5
    Instead of `splitlines`, simply testing for `if '\n' in data` is cheaper and does the same thing. – Matthias Urlichs Aug 11 '19 at 10:43
  • 1
    @MatthiasUrlichs what if `data` is `Often, the sequence to represent a new line is \`\n\`, but may include \`\r\` as well\0` – pyansharp Oct 07 '20 at 18:49
38

Based on Any yaml libraries in Python that support dumping of long strings as block literals or folded blocks?

import yaml
from collections import OrderedDict

class quoted(str):
    pass

def quoted_presenter(dumper, data):
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
yaml.add_representer(quoted, quoted_presenter)

class literal(str):
    pass

def literal_presenter(dumper, data):
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
yaml.add_representer(literal, literal_presenter)

def ordered_dict_presenter(dumper, data):
    return dumper.represent_dict(data.items())
yaml.add_representer(OrderedDict, ordered_dict_presenter)

d = OrderedDict(short=quoted("Hello"), long=literal("Line1\nLine2\nLine3\n"))

print(yaml.dump(d))

Output

short: "Hello"
long: |
  Line1
  Line2
  Line3
Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 5
    any way to do this so that it does not affect the global yaml state but it does affect one individual call to `dump()`? – Jason S Oct 13 '17 at 20:25
  • @JasonS: it is a good separate question. You could try passing you own Dumper class to `yaml.dump` with overriden represent_scalar, represent_dict methods. – jfs Oct 14 '17 at 05:37
  • For some reason this doesn't work in this instance: https://gist.github.com/retorquere/c43b5394c5e45b4c5b54b46479725e3c . Any ideas? – retorquere May 09 '19 at 16:30
  • This is a great answer and it works for me so thanks! ... but I would like to add that this works with `yaml.dump()` but doesn't work with `yaml.safe_dump()`. Unless if there's a way to get this to work with `yaml.safe_dump()` that I missed. – cbautista Dec 19 '19 at 18:56
  • 1
    To use with safe_dump(): `yaml.representer.SafeRepresenter.add_representer(OrderedDict, ordered_dict_presenter)` – jwatson0 Aug 26 '21 at 18:27
  • PyYAML will silently fall back to quoted style if your string has whitespace before any newlines or contains tab characters. This makes sense because the YAML spec can't represent these situations in block literals. To get this working I had to use something like `re.sub("\s+$","",my_string,flags=re.MULTILINE).replace("\t"," ")` – M Virts Oct 11 '22 at 21:24
15

I wanted any input with a \n in it to be a block literal. Using the code in yaml/representer.py as a base I got:

# -*- coding: utf-8 -*-
import yaml

def should_use_block(value):
    for c in u"\u000a\u000d\u001c\u001d\u001e\u0085\u2028\u2029":
        if c in value:
            return True
    return False

def my_represent_scalar(self, tag, value, style=None):
    if style is None:
        if should_use_block(value):
             style='|'
        else:
            style = self.default_style

    node = yaml.representer.ScalarNode(tag, value, style=style)
    if self.alias_key is not None:
        self.represented_objects[self.alias_key] = node
    return node


a={'short': "Hello", 'multiline': """Line1
Line2
Line3
""", 'multiline-unicode': u"""Lêne1
Lêne2
Lêne3
"""}

print(yaml.dump(a))
print(yaml.dump(a, allow_unicode=True))
yaml.representer.BaseRepresenter.represent_scalar = my_represent_scalar
print(yaml.dump(a))
print(yaml.dump(a, allow_unicode=True))

Output

{multiline: 'Line1

    Line2

    Line3

    ', multiline-unicode: "L\xEAne1\nL\xEAne2\nL\xEAne3\n", short: Hello}

{multiline: 'Line1

    Line2

    Line3

    ', multiline-unicode: 'Lêne1

    Lêne2

    Lêne3

    ', short: Hello}

After override

multiline: |
  Line1
  Line2
  Line3
multiline-unicode: "L\xEAne1\nL\xEAne2\nL\xEAne3\n"
short: Hello

multiline: |
  Line1
  Line2
  Line3
multiline-unicode: |
  Lêne1
  Lêne2
  Lêne3
short: Hello
lbt
  • 766
  • 6
  • 10
8

You can use ruamel.yaml and its RoundTripLoader/Dumper (disclaimer: I am the author of that package) apart from doing what you want, it supports the YAML 1.2 specification (from 2009), and has several other improvements:

import sys
from ruamel.yaml import YAML

yaml_str = """\
short: "Hello"  # does keep the quotes, but need to tell the loader
long: |
  Line1
  Line2
  Line3
folded: >
  some like
  explicit folding
  of scalars
  for readability
"""

yaml = YAML()
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)

gives:

short: "Hello"  # does keep the quotes, but need to tell the loader
long: |
  Line1
  Line2
  Line3
folded: >
  some like
  explicit folding
  of scalars
  for readability

(including the comment, starting in the same column as before)

You can also create this output starting from scratch, but then you do need to provide the extra information e.g. the explicit positions on where to fold.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • "provide the extra information" this looks promising. Do you have any clue about this? I can't find it in the documentation – abstrus May 10 '23 at 16:31
  • You need to provide the positions in the string that the folds need to take place. There is no documentation for that, as there is no API for that, but you can simply do the same thing the constructor of the FoldedScalarString does. – Anthon May 10 '23 at 20:37
  • @Anthon how can we store the dump to a variable? – Nikhil VJ Aug 29 '23 at 02:35
  • got a solution using StringIO: `output_stream = StringIO()` `yaml.dump(data, output_stream)` `yaml_output = output_stream.getvalue()` – Nikhil VJ Aug 29 '23 at 02:41
  • @NikhilVJ You should look at the package `ruamel.yaml.string` (or `ruamel.yaml.bytes`) – Anthon Aug 29 '23 at 06:41
5

It's worth noting that pyyaml disallows trailing spaces in block scalars and will force content into double-quoted format. It seems a lot of folk have run into this issue. If you don't care about being able to round-trip the data, this will strip out those trailing spaces:

def str_presenter(dumper, data):
    if len(data.splitlines()) > 1 or '\n' in data:  
        text_list = [line.rstrip() for line in data.splitlines()]
        fixed_data = "\n".join(text_list)
        return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style='|')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)
brian
  • 51
  • 1
  • 2
0

Using ruamel.yaml posted by Anthon here, here are simple functions to convert yaml text to dict and vice versa that you can conveniently keep in your utility funcs:

from ruamel.yaml import YAML
from io import StringIO

def yaml2dict(y):
    return YAML().load(y)

def dict2yaml(d):
    output_stream = StringIO()
    YAML().dump(d, output_stream)
    return output_stream.getvalue()

Sample multi-line yaml to dict:

y = """
title: organelles absent in animal cells and present in a plant cell
question: |
  Observe the following table and identify if the cell is of a plant or an animal
  | Organelle | Present/Absent | 
  |---------- | -------------- | 
  | Nucleus | Present |
  | Vacuole | Present |
  | Cellwall | Absent |
  | Cell membrane | Present |
  | Mitochondria | Present |
  | Chlorophyll | Absent |
answer_type: MCQ_single
choices:
- Plant
- Animal
points: 1
"""
d = yaml2dict(y)
d

output:

{'title': 'organelles absent in animal cells and present in a plant cell', 'question': 'Observe the following table and identify if the cell is of a plant or an animal\n| Organelle | Present/Absent | \n|---------- | -------------- | \n| Nucleus | Present |\n| Vacuole | Present |\n| Cellwall | Absent |\n| Cell membrane | Present |\n| Mitochondria | Present |\n| Chlorophyll | Absent |\n', 'answer_type': 'MCQ_single', 'choices': ['Plant', 'Animal'], 'points': 1}

Converting it back to yaml:

y2 = dict2yaml(d)
print(y2)

Output:

title: organelles absent in animal cells and present in a plant cell
question: |
  Observe the following table and identify if the cell is of a plant or an animal
  | Organelle | Present/Absent | 
  |---------- | -------------- | 
  | Nucleus | Present |
  | Vacuole | Present |
  | Cellwall | Absent |
  | Cell membrane | Present |
  | Mitochondria | Present |
  | Chlorophyll | Absent |
answer_type: MCQ_single
choices:
- Plant
- Animal
points: 1

For completeness, install ruamel.yaml by:

pip install ruamel.yaml
Nikhil VJ
  • 5,630
  • 7
  • 34
  • 55