2

How to convert JSON file, where some field values are multiline strings, with embedded newlines (as "\n") to YAML, where values with embedded newlines and only those values are written using literal block notation.

For example given the following JSON:

{
   "01ea672a": {
        "summary": "A short one-line summary",
        "description": "first line\nsecond line",
        "content": "1st line\n2nd line\n"
   }
}

should generate something like the following YAML (the details may differ):

---
01ea672a:
  summary: A short one-line summary
  description: |-
    first line
    second line
  content: |
    1st line
    2nd line

I would prefer solution in a scripting language, be it Python, Perl, Ruby or other, or using a command-line conversion tool like Catmandu.

The json2yaml.com on-line can do this, but I'd rather not try to use it for 40 MB file.

Jakub Narębski
  • 309,089
  • 65
  • 217
  • 230

3 Answers3

3

ruamel.yaml (disclaimer: I am the author of that library), already can round-trip your expected output without loss of any of its information (including key order):

import sys
import ruamel.yaml

yaml_str = """---
01ea672a:
  summary: A short one-line summary
  description: |-
    first line
    second line
  content: |
    1st line
    2nd line
"""

yaml = ruamel.yaml.YAML()
yaml.explicit_start = True
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)

giving:

---
01ea672a:
  summary: A short one-line summary
  description: |-
    first line
    second line
  content: |
    1st line
    2nd line

If you would add:

    print(type(data['01ea672a']['description']), type(data['01ea672a']))

you would see that these are LiteralStringScalar from ruamel.yaml.scalarstring resp. CommentedMap from ruamel.yaml.comments. The latter you can create on the fly by handing the type in to the JSON loader, and it will preserve the key order as it behaves like an ordereddict.

The former has to be "enforced" after loading as there is no 'parse_string' option to json.loads, option to do so during loading. ruamel.yaml has a utility function walk_tree that does exactly that.

With that knowledge it is trivial make a clean transformation from JSON to YAML:

import sys
import ruamel.yaml
import json

json_str = r"""
{
    "01ea672a": {
        "summary": "A short one-line summary",
        "description": "first line\nsecond line",
        "content": "1st line\n2nd line\n"
    }
}
"""

yaml = ruamel.yaml.YAML()
yaml.explicit_start = True

data = json.loads(json_str, object_pairs_hook=ruamel.yaml.comments.CommentedMap)
ruamel.yaml.scalarstring.walk_tree(data)  

yaml.dump(data, sys.stdout)

again giving exactly the output that you expect.

obataku
  • 29,212
  • 3
  • 44
  • 57
Anthon
  • 69,918
  • 32
  • 186
  • 246
2

You can use the low-level event API to do that. Simply parse the JSON as YAML to get an event stream (YAML being a superset of JSON allows this) and then modify the events in the following way:

  • Make it a block-style event (JSON-style is called flow-style in YAML).
  • If it is a scalar key, make plain-style.
  • If it is a scalar value, make it literal-style if the value contains a newline, plain-style else.

Finally, emit the modified events. Here's a solution with PyYaml:

import yaml, types
from yaml.events import *

events = []

class Level:
  def __init__(self, is_mapping):
    self.is_mapping = is_mapping
    self.is_value = True

levels = []

with open("in.json", 'r') as stream:
  for event in yaml.parse(stream):
    if len(levels) > 0 and levels[-1].is_mapping:
      levels[-1].is_value = not levels[-1].is_value
    if isinstance(event, yaml.CollectionStartEvent):
      levels.append(Level(isinstance(event, MappingStartEvent)))
      event.flow_style = False
    elif isinstance(event, CollectionEndEvent):
      levels.pop()
    elif isinstance(event, ScalarEvent):
      if len(levels) > 0 and levels[-1].is_value:
        event.style = '|' if "\n" in event.value else ''
      else:
        event.style = ''
      event.implicit = (True, True)
    events.append(event)

with open("out.yaml", 'w') as stream:
  yaml.emit(events, stream)

Note: PyYaml supports YAML 1.1, which in some edge cases is not a superset of JSON. To be sure, you may use ruamel instead which does implement YAML 1.2, but I am not familiar with its code, which is why I provide a PyYaml solution.

flyx
  • 35,506
  • 7
  • 89
  • 126
  • Would it be possible to emit events as they come, instead of storing whole file in memory (as events list)? – Jakub Narębski Nov 17 '17 at 19:48
  • It turns out that it cannot handle pretty-printed multi-line JSON with lines indented with TAB ("\t") character, as it was by accident in my `in.json`... not a problem with real JSON data files, which are single-line. – Jakub Narębski Nov 17 '17 at 20:29
0

It turns out that I was able to modify the dnozay answer to the "Any yaml libraries in Python that support dumping of long strings as block literals or folded blocks?" question.

It turns out to be a bit faster than flyx answer, though you needs some additional tricks (borrowed with modification from drbild/json2yaml) to preserve the order of keys.

The major part is to use Representer.add_representer:

class maybe_literal_str(str): pass
class maybe_literal_unicode(unicode): pass

def change_maybe_style(representer):
    def new_maybe_representer(dumper, data):
        scalar = representer(dumper, data)
        if isinstance(data, basestring) and "\n" in data:
            scalar.style = '|'
        else:
            scalar.style = None
        return scalar
    return new_maybe_representer

from yaml.representer import SafeRepresenter

# represent_str does handle some corner cases, so use that
# instead of calling represent_scalar directly 
represent_maybe_literal_str     = change_maybe_style(SafeRepresenter.represent_str)
represent_maybe_literal_unicode = change_maybe_style(SafeRepresenter.represent_unicode)

# I needed to use it in yaml.safe_dump() with older PyYAML,
# hence explicit Dumper=yaml=SafeDumper
yaml.add_representer(maybe_literal_str, represent_maybe_literal_str,
                     Dumper=yaml.SafeDumper)
yaml.add_representer(maybe_literal_unicode, represent_maybe_literal_unicode,
                     Dumper=yaml.SafeDumper)

For it to work I had to wrap strings with one of those two classes:

def wrap_strings(arg):
    """Wrap {str,unicode} arguments in maybe_literal_{str,unicode}"""
    if isinstance(arg, str):
        return maybe_literal_str(arg)
    elif isinstance(arg, unicode):
        return maybe_literal_unicode(arg)
    else:
        return arg

I have used this hacky function to modify the structure

def transform(obj, leaf_callback):
    try:
        # is it dict or something like it?
        enum = obj.iteritems()
    except AttributeError:
        # if not dict-like, it is list-like object
        enum = enumerate(obj)
    for k, v in enum:
        # is value 'v' collection or scalar (leaf value)?
        if isinstance(v, (dict, list)):
            transform(v, leaf_callback)
        else:
            newval = leaf_callback(v)
            if newval is not None:
                obj[k] = newval

The conversion from JSON to YAML was done with:

def convert_dom(json_file, yaml_file):
    loaded_json = json.load(json_file)
    transform(loaded_json, wrap_strings)
    yaml.safe_dump(loaded_json, yaml_file,
                   explicit_start=True, # start with "---\n"
                   default_flow_style=False)


with open('in.json', 'r') as json_file:
    with open('out.yaml', 'w') as yaml_file:
        convert_events(json_file, yaml_file)
Jakub Narębski
  • 309,089
  • 65
  • 217
  • 230
  • You refer to preservation of the order of the keys as they are in JSON, but I can't figure out where in your code you actually do that. I would expect an an extra parameter to `json.load()` in your final example is needed for that. – Anthon Aug 26 '18 at 08:51
  • @Anthon: Ah, true, in my code I have instead `loaded_json = json.load(json_file, object_pairs_hook=MyOrderedDict)`, where `MyOrderedDict` is subclass of `collections.OrderedDict` "monkey-patched" so that `.items()` method returns list-like object with `.sort()` method that does nothing. This was needed because `json.load` sorts keys for some reason, and as far as I know you cannot turn this behavior off. – Jakub Narębski Aug 26 '18 at 13:05
  • `yaml.(safe_)dump()` sorts the keys of dict before dumping the dict. I know there is code on [so] that prevents that from happening by changing the representer for dicts, but your "patch" of the `.sort()` method works as well, I had never thought about doing it that way. – Anthon Aug 26 '18 at 13:45