33

I made a little test case to compare YAML and JSON speed :

import json
import yaml
from datetime import datetime
from random import randint

NB_ROW=1024

print 'Does yaml is using libyaml ? ',yaml.__with_libyaml__ and 'yes' or 'no'

dummy_data = [ { 'dummy_key_A_%s' % i: i, 'dummy_key_B_%s' % i: i } for i in xrange(NB_ROW) ]


with open('perf_json_yaml.yaml','w') as fh:
    t1 = datetime.now()
    yaml.safe_dump(dummy_data, fh, encoding='utf-8', default_flow_style=False)
    t2 = datetime.now()
    dty = (t2 - t1).total_seconds()
    print 'Dumping %s row into a yaml file : %s' % (NB_ROW,dty)

with open('perf_json_yaml.json','w') as fh:
    t1 = datetime.now()
    json.dump(dummy_data,fh)
    t2 = datetime.now()
    dtj = (t2 - t1).total_seconds()
    print 'Dumping %s row into a json file : %s' % (NB_ROW,dtj)

print "json is %dx faster for dumping" % (dty/dtj)

with open('perf_json_yaml.yaml') as fh:
    t1 = datetime.now()
    data = yaml.safe_load(fh)
    t2 = datetime.now()
    dty = (t2 - t1).total_seconds()
    print 'Loading %s row from a yaml file : %s' % (NB_ROW,dty)

with open('perf_json_yaml.json') as fh:
    t1 = datetime.now()
    data = json.load(fh)
    t2 = datetime.now()
    dtj = (t2 - t1).total_seconds()
    print 'Loading %s row into from json file : %s' % (NB_ROW,dtj)

print "json is %dx faster for loading" % (dty/dtj)

And the result is :

Does yaml is using libyaml ?  yes
Dumping 1024 row into a yaml file : 0.251139
Dumping 1024 row into a json file : 0.007725
json is 32x faster for dumping
Loading 1024 row from a yaml file : 0.401224
Loading 1024 row into from json file : 0.001793
json is 223x faster for loading

I am using PyYAML 3.11 with libyaml C library on ubuntu 12.04. I know that json is much more simple than yaml, but with a 223x ratio between json and yaml I am wondering whether my configuration is correct or not.

Do you have same speed ratio ?
How can I speed up yaml.load() ?

oz123
  • 27,559
  • 27
  • 125
  • 187
Eric
  • 4,821
  • 6
  • 33
  • 60

5 Answers5

45

You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

The same kind of stuff has to happen backwards when loading.

The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:

YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.

-- UPDATE

What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.

You can do it this way:

import yaml
from yaml import CLoader as Loader, CDumper as Dumper

dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)

In order to do so, you need libyaml-cpp-dev (originally yaml-cpp-dev) installed, for instance with apt-get:

$ apt-get install libyaml-cpp-dev

And PyYaml with LibYaml as well. But that's already the case based on your output.

I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.

Shourya Bansal
  • 324
  • 2
  • 16
Jivan
  • 21,522
  • 15
  • 80
  • 131
  • 1
    loading is still 12x slower with yaml.my sample is a list of 600,000 empty dictionaries. Yaml doesn't need to do anything extra except slightly cleverer syntax analysis which should take almost no extra time. – codeshot May 31 '15 at 08:52
  • 2
    On mac: brew install yaml-cpp libyaml – Hans Nelsen Nov 15 '16 at 22:40
  • 2
    Jivan you're a bloody legend. I was going to rewrite some python code in C++ to speed things up. My 6MB yaml file took 53 seconds to load using the standard yaml loader, and only 3 seconds with CLoader. – nevelis Dec 27 '16 at 00:46
  • 1
    I am not sure why you are saying that the CLoader speedup is only of interest if you are running under Linux; I just tried this under windows and it works, giving me a huge speedup. – Mike Nakis Jan 23 '17 at 14:16
  • The comment you link to is incorrect. PyYAML doesn't build a graph. There are no connections between the `Node`s that the representer emits, not even in the case of a single object occurring multiple times in a data-structure. – Anthon Sep 01 '18 at 08:54
  • 6
    If you `cannot import name 'CLoader' from 'yaml'` try installing `libyaml-dev` and then reinstall pyyaml: `pip --no-cache-dir install --verbose --force-reinstall -I pyyaml` https://github.com/yaml/pyyaml/issues/108 – Niels-Ole Nov 06 '18 at 15:26
  • But make sure you're using safe loaders - `from yaml import CSafeLoader as Loader, CSafeDumper as Dumper` – Ezra Steinmetz Mar 03 '22 at 12:53
16

For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:

enter image description here


Code to reproduce the plot:

import json
import tomllib

import numpy
import orjson
import pandas
import perfplot
import toml
import tomli
import ujson
import yaml
from yaml import CLoader, Loader


def setup(n):
    numpy.random.seed(0)
    data = numpy.random.rand(n, 3)

    with open("out.yml", "w") as f:
        yaml.dump(data.tolist(), f)

    with open("out.json", "w") as f:
        json.dump(data.tolist(), f, indent=4)

    with open("out.dat", "w") as f:
        numpy.savetxt(f, data)

    with open("out.toml", "w") as f:
        toml.dump({"data": data.tolist()}, f)


def yaml_python(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=Loader)
    return out


def yaml_c(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=CLoader)
    return out


# def zaml_load(arr):
#     with open("out.yml", "r") as f:
#         out = zaml.load(f)
#     return out["data"]


def json_load(arr):
    with open("out.json", "r") as f:
        out = json.load(f)
    return out


def ujson_load(arr):
    with open("out.json", "r") as f:
        out = ujson.load(f)
    return out


def orjson_load(arr):
    with open("out.json", "rb") as f:
        out = orjson.loads(f.read())
    return out


def loadtxt(arr):
    with open("out.dat", "r") as f:
        out = numpy.loadtxt(f)
    return out


def pandas_read(arr):
    out = pandas.read_csv("out.dat", header=None, sep=" ")
    return out.values


def toml_load(arr):
    with open("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


def tomli_load(arr):
    with open("out.toml", "rb") as f:
        out = tomli.load(f)
    return out["data"]


def tomllib_load(arr):
    with open("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


b = perfplot.bench(
    setup=setup,
    kernels=[
        yaml_python,
        yaml_c,
        json_load,
        loadtxt,
        pandas_read,
        toml_load,
        tomli_load,
        tomllib_load,
        ujson_load,
        orjson_load,
    ],
    n_range=[2**k for k in range(18)],
)

b.save("out.png")
b.show()
Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
6

Yes.

Other answers here have said "use CLoader", which is a great tip, but if you're not using any custom classes (!!foo tags in your YAML) you can squeeze out another ~20% or so by using CBaseLoader instead of plain CLoader.

I had a script that went from ~2min37sec to ~2min7sec with this change.

Should be as easy as this:

import yaml

with open(...) as f:
    data = yaml.load(f, Loader=yaml.CBaseLoader)
Vegard
  • 2,081
  • 1
  • 17
  • 26
1

Not yet mentioned here there is also a CSafeLoader C class implementation available.

with open(file_path, 'r', encoding="utf-8") as config_file:
    config_data = yaml.load(config_file, Loader=CSafeLoader)

I found it to be fractionally faster than CLoader (& both were around 15 times faster than the python implementation at least for small files under 50kb)

As noted above CBaseLoader is around 20% faster than the other C classes. This is due to:

BaseLoader(stream) does not resolve or support any tags and construct only basic Python objects: lists, dictionaries and Unicode strings.

CBaseLoader does not recognise boolean in yaml.

Stuart Cardall
  • 2,099
  • 24
  • 18
-1

Yes, I also noticed that JSON is way faster. So a reasonable approach would be to convert YAML to JSON first. If you don't mind ruby, then you can get a big speedup and ditch the yaml install altogether:

import commands, json
def load_yaml_file(fn):
    ruby = "puts YAML.load_file('%s').to_json" % fn
    j = commands.getstatusoutput('ruby -ryaml -rjson -e "%s"' % ruby)
    return json.loads(j[1])

Here is a comparison for 100K records:

load_yaml_file: 0.95 s
yaml.load: 7.53 s

And for 1M records:

load_yaml_file: 11.55 s
yaml.load: 77.08 s

If you insist on using yaml.load anyway, remember to put it in a virtualenv to avoid conflicts with other software.

personal_cloud
  • 3,943
  • 3
  • 28
  • 38
  • 6
    I don't mind ruby, but I do mind bogus answers. 1) you're not really using ruby, in your code you are using a [thin layer around libyaml C library](https://ruby-doc.org/stdlib-2.3.0/libdoc/yaml/rdoc/YAML.html): "The underlying implementation is the libyaml wrapper Psych". 2) you compare that with PyYAML without the libyaml C library. If you had, you would see that Python wrapping libyaml is not 7 times slower but only a few percent. 3) the announcement for the deprecation of the `commands` module was made in PEP 0361 in 2006, you still propose to use that more than **eleven** years later. – Anthon Aug 25 '18 at 12:20