1

I'm working with one script that dumps a pandas series to a yaml file:

with open('ex.py','w') as f:
    yaml.dump(a_series,f)

And then another script that opens the yaml file for the pandas series:

with open('ex.py','r') as f:
    yaml.safe_load(a_series,f)

I'm trying to safe_load the series but I get a constructor error. How can I specify that the pandas series is safe to load?

Anthon
  • 69,918
  • 32
  • 186
  • 246
big11mac
  • 189
  • 3
  • 11

1 Answers1

1

When you use PyYAML's load, you specify that everything in the YAML document you are loading is safe. That is why you need to use yaml.safe_load.

In your case this leads to an error, because safe_load doesn't know how to construct pandas internals that have tags in the YAML document like:

!!python/name:pandas.core.indexes.base.Index

and

!!python/tuple

etc.

You would need to provide constructors for all the objects, add these to the SafeLoader and then do a_series = yaml.load(f). Doing that can be a lot of work, especially since what looks like a small change to the data used in your series might require you to add constructors.

You could dump the dict representation of your Series and load that back. Of course some information is lost in this process, I am not sure if that is acceptable:

import sys
import yaml
from pandas import Series

def series_representer(dumper, data):
    return dumper.represent_mapping(u'!pandas.series', data.to_dict())

yaml.add_representer(Series, series_representer, Dumper=yaml.SafeDumper)

def series_constructor(loader, node):
    d = loader.construct_mapping(node)
    return Series(data)

yaml.add_constructor(u'!pandas.series', series_constructor, Loader=yaml.SafeLoader)

data = Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])

with open('ex.yaml', 'w') as f:
    yaml.safe_dump(data, f)

with open('ex.yaml') as f:
    s = yaml.safe_load(f)

print(s)
print(type(s))

which gives:

a    1
b    2
c    3
d    4
e    5
dtype: int64
<class 'pandas.core.series.Series'>

And the ex.yaml file contains:

!pandas.series {a: 1, b: 2, c: 3, d: 4, e: 5}

There are a few things to note:

  • YAML documents are normally written to files with a .yaml extension. Using .py is bound to get you confused, or have you overwrite some program source files at some point.

  • yaml.load() and yaml.safe_load() take a stream as first paramater you use them like:

    data = yaml.safe_load(stream)
    

    and not like:

    yaml.safe_load(data, stream)
    
  • It would be better to have a two step constructor, which allows you to construct self referential data structures. However Series.append() doesn't seem to work for that:

    def series_constructor(loader, node):
        d = Series()
        yield d
        d.append(Series(loader.construct_mapping(node)))
    

If dumping the Series via a dictionary is not good enough (because it simplifies the series' data), and if you don't care about the readability of the YAML generated, you can instead of .to_dict() use to to_pickle() but you would have to work with temporary files, as that method is not flexible enough to handle file like objects and expects a file name string as argument.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • Ok so it is better to just break things down into basic data types before dumping/loading? Thanks for the thorough answer. – big11mac Jun 12 '18 at 12:40
  • @big11mac not necessarily. Using YAML with tags takes a bit of effort, but it is clear from your document that these are expected to be loaded as a special type. If you break down your data and save as primitives (mapping, sequence, scalar) only, then all of the logic of recreating the appropriate types after loading, has to be in your program. If your datastructure is predictable, flat (i.e. no arbitrary tree structure) and not too big to fit in memory twice, than not using tags can be an option. – Anthon Jun 12 '18 at 14:38