Can I speedup YAML?

Question

I made a little test case to compare YAML and JSON speed :

import json
import yaml
from datetime import datetime
from random import randint

NB_ROW=1024

print 'Does yaml is using libyaml ? ',yaml.__with_libyaml__ and 'yes' or 'no'

dummy_data = [ { 'dummy_key_A_%s' % i: i, 'dummy_key_B_%s' % i: i } for i in xrange(NB_ROW) ]


with open('perf_json_yaml.yaml','w') as fh:
    t1 = datetime.now()
    yaml.safe_dump(dummy_data, fh, encoding='utf-8', default_flow_style=False)
    t2 = datetime.now()
    dty = (t2 - t1).total_seconds()
    print 'Dumping %s row into a yaml file : %s' % (NB_ROW,dty)

with open('perf_json_yaml.json','w') as fh:
    t1 = datetime.now()
    json.dump(dummy_data,fh)
    t2 = datetime.now()
    dtj = (t2 - t1).total_seconds()
    print 'Dumping %s row into a json file : %s' % (NB_ROW,dtj)

print "json is %dx faster for dumping" % (dty/dtj)

with open('perf_json_yaml.yaml') as fh:
    t1 = datetime.now()
    data = yaml.safe_load(fh)
    t2 = datetime.now()
    dty = (t2 - t1).total_seconds()
    print 'Loading %s row from a yaml file : %s' % (NB_ROW,dty)

with open('perf_json_yaml.json') as fh:
    t1 = datetime.now()
    data = json.load(fh)
    t2 = datetime.now()
    dtj = (t2 - t1).total_seconds()
    print 'Loading %s row into from json file : %s' % (NB_ROW,dtj)

print "json is %dx faster for loading" % (dty/dtj)

And the result is :

Does yaml is using libyaml ?  yes
Dumping 1024 row into a yaml file : 0.251139
Dumping 1024 row into a json file : 0.007725
json is 32x faster for dumping
Loading 1024 row from a yaml file : 0.401224
Loading 1024 row into from json file : 0.001793
json is 223x faster for loading

I am using PyYAML 3.11 with libyaml C library on ubuntu 12.04. I know that json is much more simple than yaml, but with a 223x ratio between json and yaml I am wondering whether my configuration is correct or not.

Do you have same speed ratio ?
How can I speed up yaml.load() ?

score 45 · Answer 1 · edited May 25 '23 at 04:34

45

You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

The same kind of stuff has to happen backwards when loading.

The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:

YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.

-- UPDATE

What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.

You can do it this way:

import yaml
from yaml import CLoader as Loader, CDumper as Dumper

dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)

In order to do so, you need libyaml-cpp-dev (originally yaml-cpp-dev) installed, for instance with apt-get:

$ apt-get install libyaml-cpp-dev

And PyYaml with LibYaml as well. But that's already the case based on your output.

I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.

edited May 25 '23 at 04:34

Shourya Bansal

324
2
16

answered Jan 02 '15 at 14:56

Jivan

21,522
15
80
131

1

loading is still 12x slower with yaml.my sample is a list of 600,000 empty dictionaries. Yaml doesn't need to do anything extra except slightly cleverer syntax analysis which should take almost no extra time. – codeshot May 31 '15 at 08:52
2

On mac: brew install yaml-cpp libyaml – Hans Nelsen Nov 15 '16 at 22:40
2

Jivan you're a bloody legend. I was going to rewrite some python code in C++ to speed things up. My 6MB yaml file took 53 seconds to load using the standard yaml loader, and only 3 seconds with CLoader. – nevelis Dec 27 '16 at 00:46
1

I am not sure why you are saying that the CLoader speedup is only of interest if you are running under Linux; I just tried this under windows and it works, giving me a huge speedup. – Mike Nakis Jan 23 '17 at 14:16
The comment you link to is incorrect. PyYAML doesn't build a graph. There are no connections between the `Node`s that the representer emits, not even in the case of a single object occurring multiple times in a data-structure. – Anthon Sep 01 '18 at 08:54
6

If you `cannot import name 'CLoader' from 'yaml'` try installing `libyaml-dev` and then reinstall pyyaml: `pip --no-cache-dir install --verbose --force-reinstall -I pyyaml` https://github.com/yaml/pyyaml/issues/108 – Niels-Ole Nov 06 '18 at 15:26
But make sure you're using safe loaders - `from yaml import CSafeLoader as Loader, CSafeDumper as Dumper` – Ezra Steinmetz Mar 03 '22 at 12:53

Nico Schlömer · Answer 2 · 2023-01-15T19:55:25.077

For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:

Code to reproduce the plot:

import json
import tomllib

import numpy
import orjson
import pandas
import perfplot
import toml
import tomli
import ujson
import yaml
from yaml import CLoader, Loader


def setup(n):
    numpy.random.seed(0)
    data = numpy.random.rand(n, 3)

    with open("out.yml", "w") as f:
        yaml.dump(data.tolist(), f)

    with open("out.json", "w") as f:
        json.dump(data.tolist(), f, indent=4)

    with open("out.dat", "w") as f:
        numpy.savetxt(f, data)

    with open("out.toml", "w") as f:
        toml.dump({"data": data.tolist()}, f)


def yaml_python(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=Loader)
    return out


def yaml_c(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=CLoader)
    return out


# def zaml_load(arr):
#     with open("out.yml", "r") as f:
#         out = zaml.load(f)
#     return out["data"]


def json_load(arr):
    with open("out.json", "r") as f:
        out = json.load(f)
    return out


def ujson_load(arr):
    with open("out.json", "r") as f:
        out = ujson.load(f)
    return out


def orjson_load(arr):
    with open("out.json", "rb") as f:
        out = orjson.loads(f.read())
    return out


def loadtxt(arr):
    with open("out.dat", "r") as f:
        out = numpy.loadtxt(f)
    return out


def pandas_read(arr):
    out = pandas.read_csv("out.dat", header=None, sep=" ")
    return out.values


def toml_load(arr):
    with open("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


def tomli_load(arr):
    with open("out.toml", "rb") as f:
        out = tomli.load(f)
    return out["data"]


def tomllib_load(arr):
    with open("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


b = perfplot.bench(
    setup=setup,
    kernels=[
        yaml_python,
        yaml_c,
        json_load,
        loadtxt,
        pandas_read,
        toml_load,
        tomli_load,
        tomllib_load,
        ujson_load,
        orjson_load,
    ],
    n_range=[2**k for k in range(18)],
)

b.save("out.png")
b.show()

score 6 · Answer 3 · answered Feb 16 '22 at 07:01

Yes.

Other answers here have said "use CLoader", which is a great tip, but if you're not using any custom classes (!!foo tags in your YAML) you can squeeze out another ~20% or so by using CBaseLoader instead of plain CLoader.

I had a script that went from ~2min37sec to ~2min7sec with this change.

Should be as easy as this:

import yaml

with open(...) as f:
    data = yaml.load(f, Loader=yaml.CBaseLoader)

Stuart Cardall · Answer 4 · 2022-06-04T00:47:26.970

Not yet mentioned here there is also a CSafeLoader C class implementation available.

with open(file_path, 'r', encoding="utf-8") as config_file:
    config_data = yaml.load(config_file, Loader=CSafeLoader)

I found it to be fractionally faster than CLoader (& both were around 15 times faster than the python implementation at least for small files under 50kb)

As noted above CBaseLoader is around 20% faster than the other C classes. This is due to:

BaseLoader(stream) does not resolve or support any tags and construct only basic Python objects: lists, dictionaries and Unicode strings.

CBaseLoader does not recognise boolean in yaml.

personal_cloud · Answer 5 · 2017-09-17T03:43:23.173

-1

Yes, I also noticed that JSON is way faster. So a reasonable approach would be to convert YAML to JSON first. If you don't mind ruby, then you can get a big speedup and ditch the yaml install altogether:

import commands, json
def load_yaml_file(fn):
    ruby = "puts YAML.load_file('%s').to_json" % fn
    j = commands.getstatusoutput('ruby -ryaml -rjson -e "%s"' % ruby)
    return json.loads(j[1])

Here is a comparison for 100K records:

load_yaml_file: 0.95 s
yaml.load: 7.53 s

And for 1M records:

load_yaml_file: 11.55 s
yaml.load: 77.08 s

If you insist on using yaml.load anyway, remember to put it in a virtualenv to avoid conflicts with other software.

edited Sep 17 '17 at 03:43

answered Sep 17 '17 at 03:37

personal_cloud

3,943
3
28
38

6

I don't mind ruby, but I do mind bogus answers. 1) you're not really using ruby, in your code you are using a [thin layer around libyaml C library](https://ruby-doc.org/stdlib-2.3.0/libdoc/yaml/rdoc/YAML.html): "The underlying implementation is the libyaml wrapper Psych". 2) you compare that with PyYAML without the libyaml C library. If you had, you would see that Python wrapping libyaml is not 7 times slower but only a few percent. 3) the announcement for the deprecation of the `commands` module was made in PEP 0361 in 2006, you still propose to use that more than **eleven** years later. – Anthon Aug 25 '18 at 12:20

Can I speedup YAML?

5 Answers5

Linked