2

I need to check (verify) in python that data in YAML file are in alphabetical order by some field (see example below). Let's suppose I have some file with data in YAML format:

-
  project: presentations/demo1
  description: Some description for demo1 project
  owner: John Doe

-
  project: templates/template_demo
  description: Some template_demo
  owner: Sarah Connor

So, I have to be sure that data in this file are sorted by 'project' name. Actually, I have some solutions that is based on getting all projects names (from respective list of dicts), sort them and, then, compare with raw YAML file. But maybe there are more better solutions.

Anthon
  • 69,918
  • 32
  • 186
  • 246
machin
  • 440
  • 8
  • 21
  • 1
    In order to check whether the projects are sorted in the YAML file, you don't have to sort the project names, you just have to compare each project name with the next one, which is much quicker if you have a lot of entries. – Anthon Nov 25 '16 at 12:13

2 Answers2

1

If you simplify this problem a bit, it becomes

Check if a list is in sorted order

You can refer to nice ways to do it here

l = [ 4, 2, 3, 7, 8 ]
# this does not have to be sorted, you just have to check that the
# current entry is less than the next one
all(l[i] <= l[i+1] for i in xrange(len(l)-1))

In your case it becomes

data = parse_yaml_file() # parse your yaml data
is_sorted = all(data[i]['project'] <= data[i+1]['project'] for i in xrange(len(data)-1))
Community
  • 1
  • 1
algrebe
  • 1,621
  • 11
  • 17
  • 1
    PyYAML `load()` is not safe and could cause arbitrary code to be executed on a computer if you don't have complete control over the input YAML. So this is bad advice and unnecessary so as there is `safe_load()`. Apart from that PyYAML still only supports the old YAML 1.1 and not the YAML 1.2 specification (as published in 2009). – Anthon Nov 25 '16 at 12:48
  • 1
    @Anthon thank you ! i've edited to remove that part and focus more on the ordering. The OP says (s)he has some solutions working, so I assume the parsing is taken care of. – algrebe Nov 25 '16 at 13:00
  • @algrebe, thanks. You are right, in my case I use `safe_load()`, so I think your solution is rather preferable. The main question was about ordering. – machin Nov 25 '16 at 14:04
1

You should IMO not assume that your program name is as simple as the ones you have. If a project name becomes long the program that dumped the YAML might have wrapped the scalar string value for project over multiple lines. If the name includes special characters (for YAML) the program that dumped the name will have used single or double quotes around scalar string. In addition the - might be on the line where you have the key project and the value for the key project doesn't have to be on the same line:

- project:
    presentations/demo1
  description: Some description for demo1 project

A YAML parser will automatically reconstruct such a scalar correctly, something that is very difficult to get right using anything else but YAML parser.

Fortunately it easy to check what you want in Python using a YAML parser:

import ruamel.yaml

with open('input.yaml') as fp:
    data = ruamel.yaml.safe_load(fp)
for idx, d in enumerate(data[:-1]):
    assert d['project'] < data[idx+1]['project']

If you can have projects with the same name, you should be using <= instead of <. You will have to install ruamel.yaml in your virtualenv (you are using one for development for sure) using pip install ruamel.yaml.

If you don't just want to check the YAML, but generate a correctly ordered one you should use:

import ruamel.yaml

with open('input.yaml') as fp:
    data = ruamel.yaml.round_trip_load(fp)
ordered = True
for idx, d in enumerate(data[:-1]):
    if d['project'] > data[idx+1]['project']:
        ordered = False

if not ordered:
    project_data_map = {}
    for d in data:
         project_data_map.setdefault(d['project'], []).append(d)
    out_data = []
    for project_name in sorted(project_data_map):
        out_data.extend(project_data_map[project_name])
    with open('output.yaml', 'w') as fp:
        ruamel.yaml.round_trip_dump(out_data, fp)

This will preserve the order of the keys in the individual mappings/dicts, preserve any comments.

The setdefault().append() handles any project names that might be double/repeated in the input as seperate entries. So you will have the same amount of projects in the output as the input even if the project names of some might be the same.

Anthon
  • 69,918
  • 32
  • 186
  • 246