12

I am parsing a YAML file with around 6500 lines with this format:

foo1:
  bar1:
    blah: { name: "john", age: 123 }
  metadata: { whatever1: "whatever", whatever2: "whatever" }
  stuff:
    thing1: 
      bluh1: { name: "Doe1", age: 123 }
      bluh2: { name: "Doe2", age: 123 }
    thing2:
    ...
    thingN:
foo2:
...
fooN:

I just want to parse it with the PyYAML library (I think there is no more alternatives to it in Python: How can I parse a YAML file in Python).

Just for testing, I write that code to parse my file:

import yaml

config_file = "/path/to/file.yaml"

stream = open(config_file, "r")
sensors = yaml.load(stream)

Executing the script with time command along with the script I get this time:

real    0m3.906s
user    0m3.672s
sys     0m0.100s

That values doesn't seem too good really. I just want to test the same with JSON, just converting the same YAML file to JSON first:

import json

config_file = "/path/to/file.json"

stream = open(config_file, "r")
sensors = json.load(stream)  # We read the yaml config file

But the execution time is far better:

real    0m0.058s
user    0m0.032s
sys     0m0.008s

Why is the main reason that PyYAML spends more time parsing the YAML file than parsing the JSON one? Is it a problem of PyYAML or is it because of the YAML format is hard to parse? (probably is the first one)

EDIT:

I add another example with ruby and YAML:

require 'yaml'

sensors = YAML.load_file('/path/to/file.yaml')

And the execution time is good! (or at least not as bad as the PyYAML example):

real    0m0.278s
user    0m0.240s
sys     0m0.032s
Community
  • 1
  • 1
Pigueiras
  • 18,778
  • 10
  • 64
  • 87
  • Similar question => http://stackoverflow.com/questions/2451732/how-is-it-that-json-serialization-is-so-much-faster-than-yaml-serialization-in-p – moliware Aug 23 '13 at 14:19
  • @moliware Yes, I read that one before. But the question was about serialization, and the answers don't seem to answer my question :( – Pigueiras Aug 23 '13 at 14:35
  • After your edit I understand. Did you install it with the proper option: $ python setup.py --with-libyaml install – moliware Aug 23 '13 at 14:40
  • Yes, I followed the instructions here: http://rmcgibbo.github.io/blog/2013/05/23/faster-yaml-parsing-with-libyaml/ but LibYAML didn't improve anything. – Pigueiras Aug 23 '13 at 14:43
  • Can you post a link to the yaml and json data files you are using. I want to compare the two and see where the time is being spent. – Marwan Alsabbagh Aug 23 '13 at 17:24
  • I have to prepare a new one because the one that I am using contains sensitive data :( – Pigueiras Aug 23 '13 at 18:09
  • @MarwanAlsabbagh http://tny.cz/34e8b128 – Pigueiras Aug 23 '13 at 18:35

1 Answers1

22

According to the docs you must use CLoader/CSafeLoader (and CDumper):

import yaml
try:
    from yaml import CLoader as Loader
except ImportError:
    from yaml import Loader

config_file = "test.yaml"

stream = open(config_file, "r")
sensors = yaml.load(stream, Loader=Loader)

This gives me

real    0m0.503s

instead of

real    0m2.714s
chlunde
  • 1,547
  • 11
  • 11