150

I'd like to get PyYAML's loader to load mappings (and ordered mappings) into the Python 2.7+ OrderedDict type, instead of the vanilla dict and the list of pairs it currently uses.

What's the best way to do that?

dreftymac
  • 31,404
  • 26
  • 119
  • 182
Eric Naeseth
  • 2,293
  • 2
  • 19
  • 18

8 Answers8

187

Python >= 3.6

In python 3.6+, it seems that dict loading order is preserved by default without special dictionary types. The default Dumper, on the other hand, sorts dictionaries by key. Starting with pyyaml 5.1, you can turn this off by passing sort_keys=False:

a = dict(zip("unsorted", "unsorted"))
s = yaml.safe_dump(a, sort_keys=False)
b = yaml.safe_load(s)

assert list(a.keys()) == list(b.keys())  # True

This can work due to the new dict implementation that has been in use in pypy for some time. While still considered an implementation detail in CPython 3.6, "the insertion-order preserving nature of dicts has been declared an official part of the Python language spec" as of 3.7+, see What's New In Python 3.7.

Note that this is still undocumented from PyYAML side, so you shouldn't rely on this for safety critical applications.

Original answer (compatible with all known versions)

I like @James' solution for its simplicity. However, it changes the default global yaml.Loader class, which can lead to troublesome side effects. Especially, when writing library code this is a bad idea. Also, it doesn't directly work with yaml.safe_load().

Fortunately, the solution can be improved without much effort:

import yaml
from collections import OrderedDict

def ordered_load(stream, Loader=yaml.SafeLoader, object_pairs_hook=OrderedDict):
    class OrderedLoader(Loader):
        pass
    def construct_mapping(loader, node):
        loader.flatten_mapping(node)
        return object_pairs_hook(loader.construct_pairs(node))
    OrderedLoader.add_constructor(
        yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
        construct_mapping)
    return yaml.load(stream, OrderedLoader)

# usage example:
ordered_load(stream, yaml.SafeLoader)

For serialization, you could use the following funcion:

def ordered_dump(data, stream=None, Dumper=yaml.SafeDumper, **kwds):
    class OrderedDumper(Dumper):
        pass
    def _dict_representer(dumper, data):
        return dumper.represent_mapping(
            yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
            data.items())
    OrderedDumper.add_representer(OrderedDict, _dict_representer)
    return yaml.dump(data, stream, OrderedDumper, **kwds)

# usage:
ordered_dump(data, Dumper=yaml.SafeDumper)

In each case, you could also make the custom subclasses global, so that they don't have to be recreated on each call.

coldfix
  • 6,604
  • 3
  • 40
  • 50
  • 4
    +1 - thank you very much for this, it's saved me so much trouble. – Nobilis Mar 25 '14 at 11:05
  • 2
    This implementation breaks YAML merge tags, BTW – Randy Jul 29 '14 at 16:00
  • 1
    @Randy Thanks. I didn't run in that scenario before, but now I added a fix to handle this as well (I hope). – coldfix Jul 29 '14 at 19:59
  • This would have saved me horrible hacks in a yaml based charactersheet I wrote a handful of years ago. Maybe it’s time to revisit that. I hope something like this goes upstream eventually! – Arne Babenhauserheide Apr 16 '15 at 13:07
  • 9
    @ArneBabenhauserheide I am not sure if PyPI is upstream enough, but take a look at [ruamel.yaml](https://pypi.python.org/pypi/ruamel.yaml) (I am the author of that) if you think it does. – Anthon Jun 10 '15 at 18:05
  • 1
    @Anthon Your ruamel.yaml library works very well. Thanks for that. – Jan Vlcinsky Nov 21 '15 at 23:42
  • 1
    @coldfix, the ordered_dump() isn't working for me. The simple items are coming out properly, but the nested dictionaries are not. For example: swagger: '2.0' info: description: My API version: v1 title: My API contact: {name: Company, url: 'https://api.company.com', email: company@company.com} Any ideas why this might be? Thanks. – Martin Del Vecchio Jan 30 '18 at 16:24
  • @MartinDelVecchio what doesn't work exactly? If you don't like the formatting, try passing `default_flow_style=False` as keyword argument. – coldfix Jan 30 '18 at 16:50
  • I figured that out, but StackOverflow wouldn't let me edit my comment too many times. Without default_flow_style=False, the YAML syntax was incorrect. With it, it is correct. Thanks! – Martin Del Vecchio Jan 30 '18 at 20:46
  • @MartinDelVecchio It's still correct YAML syntax without, just less pretty. – coldfix Jan 31 '18 at 13:38
  • To achieve yaml.safe_load, just make loader inherit from SafeLoader `def ordered_load(stream, Loader=yaml.SafeLoader, object_pairs_hook=OrderedDict):` In PyYAML 4.1 and newer, the `yaml.load()` API will act like `yaml.safe_load()` – PT Huynh Oct 19 '18 at 06:44
  • When used with a file that contains `jinja` templates, this results in `unhashable type: collections.OrderedDict`. I presume that the custom loader generates and `OrderedDict`, which it then attempts to process again, but can't, because it's not hashable. – orodbhen Feb 28 '19 at 14:29
  • Perhaps something has changed in the `yaml` upstream code, but the ordered loader no longer works. The loaded data is definitely being sorted. – orodbhen Jun 11 '19 at 13:45
60

2018 option:

oyaml is a drop-in replacement for PyYAML which preserves dict ordering. Both Python 2 and Python 3 are supported. Just pip install oyaml, and import as shown below:

import oyaml as yaml

You'll no longer be annoyed by screwed-up mappings when dumping/loading.

Note: I'm the author of oyaml.

wim
  • 338,267
  • 99
  • 616
  • 750
57

The yaml module allow you to specify custom 'representers' to convert Python objects to text and 'constructors' to reverse the process.

_mapping_tag = yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG

def dict_representer(dumper, data):
    return dumper.represent_dict(data.iteritems())

def dict_constructor(loader, node):
    return collections.OrderedDict(loader.construct_pairs(node))

yaml.add_representer(collections.OrderedDict, dict_representer)
yaml.add_constructor(_mapping_tag, dict_constructor)
Brice M. Dempsey
  • 1,985
  • 20
  • 16
  • 6
    any explanations for this answer? – Shuman Mar 01 '16 at 07:18
  • 1
    Or even better `from six import iteritems` and then change it to `iteritems(data)` so that it works equally well in Python 2 & 3. – Midnighter Apr 28 '17 at 14:50
  • 5
    This seems to be using undocumented features of PyYAML (`represent_dict` and `DEFAULT_MAPPING_TAG`). Is this because the documentation is incomplete, or are these features unsupported and subject to change without notice? – aldel Sep 13 '17 at 00:05
  • 3
    Note that for `dict_constructor` you'll need to call `loader.flatten_mapping(node)` or you won't be able to load `<<: *...` (merge syntax) – anthony sottile Apr 11 '18 at 15:58
  • @brice-m-dempsey can you add any example how to use your code? It does not seem to work in my case (Python 3.7) – schaffe May 07 '19 at 00:37
  • –1 This breaks `yaml` module. See https://yaml.org/type/merge.html for an example of valid markup which subsequently fails to load. – wim Jan 12 '20 at 15:59
29

2015 (and later) option:

ruamel.yaml is a drop in replacement for PyYAML (disclaimer: I am the author of that package). Preserving the order of the mappings was one of the things added in the first version (0.1) back in 2015. Not only does it preserve the order of your dictionaries, it will also preserve comments, anchor names, tags and does support the YAML 1.2 specification (released 2009)

The specification says that the ordering is not guaranteed, but of course there is ordering in the YAML file and the appropriate parser can just hold on to that and transparently generate an object that keeps the ordering. You just need to choose the right parser, loader and dumper¹:

import sys
from ruamel.yaml import YAML

yaml_str = """\
3: abc
conf:
    10: def
    3: gij     # h is missing
more:
- what
- else
"""

yaml = YAML()
data = yaml.load(yaml_str)
data['conf'][10] = 'klm'
data['conf'][3] = 'jig'
yaml.dump(data, sys.stdout)

will give you:

3: abc
conf:
  10: klm
  3: jig       # h is missing
more:
- what
- else

data is of type CommentedMap which functions like a dict, but has extra information that is kept around until being dumped (including the preserved comment!)

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • That's pretty nice if you already have a YAML file, but how do you do that using a Python structure? I tried using `CommentedMap` directly but it does not work, and `OrderedDict` puts `!!omap` everywhere which is not very user-friendly. – Holt Jul 14 '20 at 09:56
  • I am not sure why CommentedMap did not work for you. Can you post a question with your (minimalized) code and tag it ruamel.yaml? That way I will be notified and answer. – Anthon Jul 15 '20 at 08:33
  • Sorry, I think it's because I tried to save the `CommentedMap` with `safe=True` in `YAML`, which did not work (using `safe=False` works). I also had issue with `CommentedMap` not being modifiable, but I cannot reproduce it now... I'll open a new question if I encounter this issue again. – Holt Jul 15 '20 at 08:35
  • You should be using `yaml = YAML()`, you get the round-trip parser/dumper and that is derivative of the safe parser/dumper that knows about CommentedMap/Seq etc. – Anthon Jul 15 '20 at 08:38
  • In fact it is possible to preserve key order (but obviously not comments) in safe mode too! Say if I need to dump a plain dict to .yaml and keep the key order then yaml = YAML(typ='safe', pure=True); yaml.sort_base_mapping_type_on_output = False; will do the trick. However, setting of sort_base_mapping_type_on_output should be done immediately after yaml creation or at least before any dumping, otherwise it is not propagated to the representer. Still you can always do yaml.representer.sort_base_mapping_type_on_output = False. – serge.v Nov 12 '21 at 14:15
  • 1
    @serge.v That is a side effect of you using a more modern version of Python than was current when this answer was given. The underlying `dict()` in Python preserves order nowadays, but it didn't use to. – Anthon Nov 12 '21 at 15:24
15

Note: there is a library, based on the following answer, which implements also the CLoader and CDumpers: Phynix/yamlloader

I doubt very much that this is the best way to do it, but this is the way I came up with, and it does work. Also available as a gist.

import yaml
import yaml.constructor

try:
    # included in standard lib from Python 2.7
    from collections import OrderedDict
except ImportError:
    # try importing the backported drop-in replacement
    # it's available on PyPI
    from ordereddict import OrderedDict

class OrderedDictYAMLLoader(yaml.Loader):
    """
    A YAML loader that loads mappings into ordered dictionaries.
    """

    def __init__(self, *args, **kwargs):
        yaml.Loader.__init__(self, *args, **kwargs)

        self.add_constructor(u'tag:yaml.org,2002:map', type(self).construct_yaml_map)
        self.add_constructor(u'tag:yaml.org,2002:omap', type(self).construct_yaml_map)

    def construct_yaml_map(self, node):
        data = OrderedDict()
        yield data
        value = self.construct_mapping(node)
        data.update(value)

    def construct_mapping(self, node, deep=False):
        if isinstance(node, yaml.MappingNode):
            self.flatten_mapping(node)
        else:
            raise yaml.constructor.ConstructorError(None, None,
                'expected a mapping node, but found %s' % node.id, node.start_mark)

        mapping = OrderedDict()
        for key_node, value_node in node.value:
            key = self.construct_object(key_node, deep=deep)
            try:
                hash(key)
            except TypeError, exc:
                raise yaml.constructor.ConstructorError('while constructing a mapping',
                    node.start_mark, 'found unacceptable key (%s)' % exc, key_node.start_mark)
            value = self.construct_object(value_node, deep=deep)
            mapping[key] = value
        return mapping
Mayou36
  • 4,613
  • 2
  • 17
  • 20
Eric Naeseth
  • 2,293
  • 2
  • 19
  • 18
  • If you want to include the `key_node.start_mark` attribute in your error message, I don't see any obvious way to simplify your central construction loop. If you try to make use of the fact that the `OrderedDict` constructor will accept an iterable of key, value pairs, you lose access to that detail when generating the error message. – ncoghlan Feb 26 '11 at 15:52
  • has anyone tested this code properly? I can not get it to work in my application! – theAlse Jun 04 '13 at 07:03
  • Example Usage: ordered_dict = yaml.load( ''' b: 1 a: 2 ''', Loader=OrderedDictYAMLLoader) # ordered_dict = OrderedDict([('b', 1), ('a', 2)]) Unfortunately my edit to the post was rejected, so please excuse lack of formatting. – Colonel Panic Oct 16 '13 at 22:06
  • This implementation breaks loading of [ordered mapping types](http://yaml.org/type/omap.html). To fix this, you can just remove the second call to `add_constructor` in your `__init__` method. – Ryan Feb 02 '17 at 23:19
11

Update: the library was deprecated in favor of the yamlloader (which is based on the yamlordereddictloader)

I've just found a Python library (https://pypi.python.org/pypi/yamlordereddictloader/0.1.1) which was created based on answers to this question and is quite simple to use:

import yaml
import yamlordereddictloader

datas = yaml.load(open('myfile.yml'), Loader=yamlordereddictloader.Loader)
Mayou36
  • 4,613
  • 2
  • 17
  • 20
Alex Chekunkov
  • 667
  • 7
  • 12
3

On my For PyYaml installation for Python 2.7 I updated __init__.py, constructor.py, and loader.py. Now supports object_pairs_hook option for load commands. Diff of changes I made is below.

__init__.py

$ diff __init__.py Original
64c64
< def load(stream, Loader=Loader, **kwds):
---
> def load(stream, Loader=Loader):
69c69
<     loader = Loader(stream, **kwds)
---
>     loader = Loader(stream)
75c75
< def load_all(stream, Loader=Loader, **kwds):
---
> def load_all(stream, Loader=Loader):
80c80
<     loader = Loader(stream, **kwds)
---
>     loader = Loader(stream)

constructor.py

$ diff constructor.py Original
20,21c20
<     def __init__(self, object_pairs_hook=dict):
<         self.object_pairs_hook = object_pairs_hook
---
>     def __init__(self):
27,29d25
<     def create_object_hook(self):
<         return self.object_pairs_hook()
<
54,55c50,51
<         self.constructed_objects = self.create_object_hook()
<         self.recursive_objects = self.create_object_hook()
---
>         self.constructed_objects = {}
>         self.recursive_objects = {}
129c125
<         mapping = self.create_object_hook()
---
>         mapping = {}
400c396
<         data = self.create_object_hook()
---
>         data = {}
595c591
<             dictitems = self.create_object_hook()
---
>             dictitems = {}
602c598
<             dictitems = value.get('dictitems', self.create_object_hook())
---
>             dictitems = value.get('dictitems', {})

loader.py

$ diff loader.py Original
13c13
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
18c18
<         BaseConstructor.__init__(self, **constructKwds)
---
>         BaseConstructor.__init__(self)
23c23
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
28c28
<         SafeConstructor.__init__(self, **constructKwds)
---
>         SafeConstructor.__init__(self)
33c33
<     def __init__(self, stream, **constructKwds):
---
>     def __init__(self, stream):
38c38
<         Constructor.__init__(self, **constructKwds)
---
>         Constructor.__init__(self)
EricGreg
  • 1,098
  • 1
  • 10
  • 18
-1

here's a simple solution that also checks for duplicated top level keys in your map.

import yaml
import re
from collections import OrderedDict

def yaml_load_od(fname):
    "load a yaml file as an OrderedDict"
    # detects any duped keys (fail on this) and preserves order of top level keys
    with open(fname, 'r') as f:
        lines = open(fname, "r").read().splitlines()
        top_keys = []
        duped_keys = []
        for line in lines:
            m = re.search(r'^([A-Za-z0-9_]+) *:', line)
            if m:
                if m.group(1) in top_keys:
                    duped_keys.append(m.group(1))
                else:
                    top_keys.append(m.group(1))
        if duped_keys:
            raise Exception('ERROR: duplicate keys: {}'.format(duped_keys))
    # 2nd pass to set up the OrderedDict
    with open(fname, 'r') as f:
        d_tmp = yaml.load(f)
    return OrderedDict([(key, d_tmp[key]) for key in top_keys])