3

I am looking for a way to parse yaml file and change each string then save file without changing structure of original file. In my opinion I should not use Regex for this but some kind of yaml parser. Sample yaml input bellow:

receipt:     Oz-Ware Purchase Invoice
date:        2007-08-06
customer:
    given:   Dorothy

items:
    - part_no:   A4786
      descrip:   Water Bucket (Filled)

    - part_no:   E1628
      descrip:   High Heeled "Ruby" Slippers
      size:      8

bill-to:  &id001
    street: |
            123 Tornado Alley
            Suite 16
    city:   East Centerville
    state:  KS

ship-to:  *id001

specialDelivery:  >
    Follow the Yellow Brick
    Road to the Emerald City.
...

Desired output:

receipt:     ###Oz-Ware Purchase Invoice###
date:        ###2007-08-06###
customer:
    given:   ###Dorothy###

items:
    - part_no:   ###A4786###
      descrip:   ###Water Bucket (Filled)###

    - part_no:   ###E1628###
      descrip:   ###High Heeled "Ruby" Slippers###
      size:      ###8###

bill-to:  ###&id001###
    street: |
            ###123 Tornado Alley
            Suite 16###
    city:   ###East Centerville###
    state:  ###KS###

ship-to:  ###*id001###

specialDelivery:  >
    ###Follow the Yellow Brick
    Road to the Emerald City.###
...

Is there a good yaml parser that could handle complicated yaml files, change strings and save that data back without affecting structure of document? Maybe you have other idea how to solve this problem. Basically i would like to iterate through each string from the top of the document and do some modification on the string. Any hints appreciated.

wariacik
  • 335
  • 1
  • 7
  • 15
  • http://www.codeproject.com/Articles/28720/YAML-Parser-in-C – Dreamweaver May 27 '15 at 21:06
  • 1
    Have you tried YamlDotNet? It seems to provide what you need. https://github.com/aaubry/YamlDotNet – kspearrin May 27 '15 at 21:10
  • 1
    Most YAML parsers will discard the extra spaces before the values and loose all of the implicit alignment information. The parser I know will also interpret the anchor and reference on reading in (and create references to the same data. I can show you how to do most of that in Python (the folded style scalars are a problem), if that is an option, but since this is marked C# I won't unless you confirm that is ok. – Anthon May 28 '15 at 15:28
  • @Dreamweaver Thanks a lot for suggestions but I couldn't find any sample how to iterate/change through each string. – wariacik May 28 '15 at 19:41
  • @Anthon Although I prefer to use a C#, I could use the python solution as an alternative. If there won't be any answer written in C# then I will accept your solution. – wariacik May 28 '15 at 19:41
  • @kspearrin - thanks a lot for suggestion, but just as in Dreamweaver suggestion I could not find any sample how to iterate yaml file and change string. – wariacik May 28 '15 at 19:43

2 Answers2

2

The YAML specification has this to say:

In the representation model, mapping keys do not have an order. To serialize a mapping, it is necessary to impose an ordering on its keys. This order is a serialization detail and should not be used when composing the representation graph (and hence for the preservation of application data). In every case where node order is significant, a sequence must be used. For example, an ordered mapping can be represented as a sequence of mappings, where each mapping is a single key: value pair. YAML provides convenient compact notation for this case.

So you really shouldn’t expect YAML to maintain any order when loading and saving documents.

That being said, I totally understand where you are coming from. Since YAML documents are meant for humans, maintaining a certain order is definitely helpful. Unfortunately, because of the specification, most implementations will use unordered data structures to represent the key/value mappings. In C# and Python, this would be a dictionary; and dictionaries are by design without order.

But both C# and Python do have ordered dictionary types, OrderedDictionary and OrderedDict, and at least for Python, there has been some effort in the past to maintain the key order using ordered dictionaries:

That’s the Python side; I’m sure there are similar efforts for C# implementations too.

Community
  • 1
  • 1
poke
  • 369,085
  • 72
  • 557
  • 602
1

Most YAML parsers are build for reading YAML, either written by other programs or edited by humans, and for writing YAML to be read by other programs. What is notoriously lacking is the ability of parsers to write YAML that is still readable by humans:

  • the order of mapping keys is undefined
  • comments get thrown away
  • the scalar literal block style, if any, is dropped
  • spacing around scalars is discarded
  • the scalar folding information, if any, is dropped

The loading of a dump of a loaded handcrafted YAML file will result in the same internal data structures as the intial load, but the intermediate dump doesn't normally look like the original (handcrafted) YAML.

If you have a Python program:

import ruamel.yaml as yaml

yaml_str = """\
receipt:     Oz-Ware Purchase Invoice
date:        2007-08-06
customer:
    given:   Dorothy

items:
    - part_no:   A4786
      descrip:   Water Bucket (Filled)

    - part_no:   E1628
      descrip:   High Heeled "Ruby" Slippers
      size:      8

bill-to:  &id001
    street: |
            123 Tornado Alley
            Suite 16
    city:   East Centerville
    state:  KS

ship-to:  *id001

specialDelivery:  >
    Follow the Yellow Brick
    Road to the Emerald City.
"""

data1 = yaml.load(yaml_str, Loader=yaml.Loader)
dump_str = yaml.dump(data1, Dumper=yaml.Dumper)
data2 = yaml.load(dump_str, Loader=yaml.Loader)

Then the following assertions hold:

assert data1 == data2
assert dump_str != yaml_str

The intermediate dump_str looks like:

bill-to: &id001 {city: East Centerville, state: KS, street: '123 Tornado Alley

    Suite 16

    '}
customer: {given: Dorothy}
date: 2007-08-06
items:
- {descrip: Water Bucket (Filled), part_no: A4786}
- {descrip: High Heeled "Ruby" Slippers, part_no: E1628, size: 8}
receipt: Oz-Ware Purchase Invoice
ship-to: *id001
specialDelivery: 'Follow the Yellow Brick Road to the Emerald City.

  '

The above is the default behaviour for ruamel.yaml, PyYAML and for many YAML parsers in other language and online YAML conversion services. For some parsers this is the only behaviour provided.

The reason for me to start ruamel.yaml as an enhancement of PyYAML was to make going from handcrafted YAML to internal data, to YAML, result in something that is better human readable (what I call round-tripping), and preserves more information (especially comments).

data = yaml.load(yaml_str, Loader=yaml.RoundTripLoader)
print yaml.dump(data, Dumper=yaml.RoundTripDumper)

gives you:

receipt: Oz-Ware Purchase Invoice
date: 2007-08-06
customer:
  given: Dorothy
items:
- part_no: A4786
  descrip: Water Bucket (Filled)
- part_no: E1628
  descrip: High Heeled "Ruby" Slippers
  size: 8
bill-to: &id001
  street: |
    123 Tornado Alley
    Suite 16
  city: East Centerville
  state: KS
ship-to: *id001
specialDelivery: 'Follow the Yellow Brick Road to the Emerald City.

  '

My focus has been on comments, key, order and literal block style. Spacing around scalars and folded scalars are not (yet) special.


Starting from there (you could also do this in PyYAML, but you would not have the built-in enhancements of ruamel.yaml key order keeping) you can either provide special emitters, or hook into the system at a lower level, overriding some methods in emitter.py (and making sure you can call the originals for the cases you don't need to handle:

def rewrite_write_plain(self, text, split=True):
    if self.state == self.expect_block_mapping_simple_value:
        text = '###' + text + '###'
        while self.column < 20:
            text = ' ' + text
            self.column += 1
    self._org_write_plain(text, split)

def rewrite_write_literal(self, text):
    if self.state == self.expect_block_mapping_simple_value:
        last_nl = False
        if text and text[-1] == '\n':
            last_nl = True
            text = text[:-1]
        text = '###' + text + '###'
        if False:
            extra_indent = ''
            while self.column < 15:
                text = ' ' + text
                extra_indent += ' '
                self.column += 1
            text = text.replace('\n', '\n' + extra_indent)
        if last_nl:
            text += '\n'
    self._org_write_literal(text)

def rewrite_write_single_quoted(self, text, split=True):
    if self.state == self.expect_block_mapping_simple_value:
        last_nl = False
        if text and text[-1] == u'\n':
            last_nl = True
            text = text[:-1]
        text = u'###' + text + u'###'
        if last_nl:
            text += u'\n'
    self.write_folded(text)

def rewrite_write_indicator(self, indicator, need_whitespace,
                    whitespace=False, indention=False):
    if indicator and indicator[0] in u"*&":
        indicator = u'###' + indicator + u'###'
        while self.column < 20:
            indicator = ' ' + indicator
            self.column += 1
    self._org_write_indicator(indicator, need_whitespace, whitespace,
                              indention)

dumper._org_write_plain = dumper.write_plain
dumper.write_plain = rewrite_write_plain
dumper._org_write_literal = dumper.write_literal
dumper.write_literal = rewrite_write_literal
dumper._org_write_single_quoted = dumper.write_single_quoted
dumper.write_single_quoted = rewrite_write_single_quoted
dumper._org_write_indicator = dumper.write_indicator
dumper.write_indicator = rewrite_write_indicator

print yaml.dump(data, Dumper=dumper, indent=4)

gives you:

receipt:             ###Oz-Ware Purchase Invoice###
date:                ###2007-08-06###
customer:
    given:           ###Dorothy###
items:
-   part_no:         ###A4786###
    descrip:         ###Water Bucket (Filled)###
-   part_no:         ###E1628###
    descrip:         ###High Heeled "Ruby" Slippers###
    size:            ###8###
bill-to:             ###&id001###
    street: |
        ###123 Tornado Alley
        Suite 16###
    city:            ###East Centerville###
    state:           ###KS###
ship-to:             ###*id001###
specialDelivery: >
    ###Follow the Yellow Brick Road to the Emerald City.###

which hopefully is acceptable for further processing in C#

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • Thanks for the solution, however with some more complicated files I am getting: ruamel.yaml.scanner.ScannerError: while scanning a double-quoted scalar in "", line 116, column 17: hint: "For a more detailed explanation ... ^ found unknown escape character '&' in "", line 116, column 72: ... planation,\n <a data-toggle=\"modal\" data-target=\ ... – wariacik May 30 '15 at 09:32
  • the second error I am getting is: ruamel.yaml.scanner.ScannerError: mapping values are not allowed here in "", line 1, column 7: --- en: – wariacik May 30 '15 at 09:34
  • @wariacik 1) Can I access the exact file that throws the error somewhere? 2) Which version of Python is this on (`python --version`)? That might be a problem with the Python3 series. – Anthon May 30 '15 at 10:24
  • Thanks a lot for response. Here you go - simplified yaml https://gist.github.com/anonymous/758c5fedd339ee061b59 version Python 2.6 (r26:66714 – wariacik May 30 '15 at 14:24
  • @wariacik normally, if you get something after '---' at all it has to be a driective indicator (i.e. what type the following data is). 'en:' is a mapping key and should be on a new line, unindented. What program generated that file? And what about the more complicated file you mentioned in your first comment? – Anthon May 30 '15 at 14:52
  • @wariacik with the example in the first comment I expect that the special character `&` was not in a quoted string, where it should have been. But that is only a guess from the error message and the small context. – Anthon May 30 '15 at 15:29
  • if you comment #---en, and process file, then you will get the second error that I have mentioned(Unescaped char - line 9). – wariacik May 30 '15 at 15:42
  • The error is in column 72 that is `\` before "&quot" and the reason being that in double quoted strings you can escape certain characters (such as newline with `\n`) but `\q` is not a defined escape sequence. Here are the allowed escaped chars defined http://yaml.org/spec/1.2/spec.html#id2776092 – Anthon May 30 '15 at 17:11
  • So that backslash should not be there, or it should be doubled. What program did generate this file? – Anthon May 30 '15 at 17:12
  • - I've sent you an email to anthon@mnt.org with those complicated files. Even if I've escaped the file, the result of parsing is not complete. If you would have a little bit time to look at them that would be great. Any help is appreciated – wariacik May 30 '15 at 21:49