Python: performance of custom JSON decoder

Question

I have an application which periodically dumps and loads a JSON file into Python using the standard JSON facilities.

Early on, we decided that it was a lot more convenient to work with the loaded JSON data as objects, rather than dictionaries. This really comes down to the convenience of "dot" member access, as opposed to [] notation for dictionary key lookup. One of the advantages of Javascript is that there is no real difference between dictionary lookup and member data access (which is why JSON is particularly suited to Javascript, I guess). But in Python, dictionary keys and object data members are different things.

So, our solution was to just use a custom JSON decoder which uses an object_hook function to return objects instead of dictionaries.

And we lived happily ever after... until now, when this design decision may turn out to be a mistake. You see, now the JSON dump file has grown rather large, (> 400 MB). As far as I know, the standard Python 3 JSON facilities use native code to do the actual parsing, so they are quite fast. But if you provide a custom object_hook, it still has to execute interpreted byte code for every JSON object decoded - which SERIOUSLY slows things down. Without object_hook it takes only about 20 seconds to decode the whole 400 MB file. But with the hook, it takes over half an hour!

So, at this point 2 options come to mind, neither of which are very pleasant. One is to just forget about the convenience of using "dot" member data access, and just use Python dictionaries. (This means changing significant amounts of code.) The other is to write a C extension module and use that as the object_hook, and see if we get any speedup.

I am wondering if there is some better solution I am not thinking of - perhaps an easier way to get "dot" member access while still initially decoding to a Python dictionary.

Any suggestions, solutions to this problem?

Does it *have* to be JSON? You could consider the [`pickle` module](http://docs.python.org/py3k/library/pickle.html) instead; it let's you store and restore whole python objects. [`shelve`](http://docs.python.org/py3k/library/shelve.html) and [`marshall`](http://docs.python.org/py3k/library/marshal.html) might also fit your needs. — Martijn Pieters, Sep 13 '12 at 17:09
Before writing in C you could try [cython](http://cython.org/). — Bakuriu, Sep 13 '12 at 17:33
@Martijn Peters, using JSON was an original requirement for interoperability with the data apart from the application. The Python-specific pickle format is unfortunately not an option. — Channel72, Sep 13 '12 at 17:39

score 3 · Accepted Answer · answered Sep 13 '12 at 17:44

3

You can try to instead of using object_hook, let json return a dictionary, and then dump that dictionary into a namedtuple.

Something like this:

from collections import namedtuple
result = json.parse(data)
JsonData = namedtuple("JsonData", result.keys())
jsondata = JsonData(**result)

I don't know what the speed of that would be, but it's worth a try.

answered Sep 13 '12 at 17:44

Lennart Regebro

167,292
41
224
251

1

@pythonm Threading isn't going to make this faster, at least, not on CPython. You could have a look at: http://stackoverflow.com/questions/203912/does-python-support-multiprocessor-multicore-programming – Thomas Orozco Sep 13 '12 at 19:53

nemo · Answer 2 · 2012-09-13T19:51:06.847

What about using the returned dict of the native JSON module and wrapping it with a object which provides dot access?

You could do something like:

class DictWrap(object):

def __init__(self, d):
    self.__d = d

def __getattr__(self, attr):
    try:
        return self.__d[attr]
    except KeyError:
        raise AttributeError


dw = DictWrap({"a": "foo", "b": "bar"})

print dw.a, dw.b // foo bar
print dw.c // AttributeError

Edit: Just saw the answer of Lennart Regebro, I'd go for that.

score 0 · Answer 3 · answered Sep 14 '12 at 01:04

I depends on the usage.

Lennart Regebro solution would work perfectly for plain dictionary case (which is probably not true for your case). Otherwise you need to implement recursive solution. But in this case - python will create a class for each dictionary inside your json.

Solution from nemo is more 'lazy'/'on demand', so if you are not going to use each field of your dictionary I would go with nemo's solution. But modify it for nested dictionaries and arrays.

def __getattr__(self, attr):
  ...
  if isinstance(self.__d[attr], dict):
    return DictWrap(self.__d[attr])

  elif isinstance(self.__d[attr], list):
    return ListWrap(self.__d[attr])    # and create similar wrapper for List.
  ...

Another solution for plain dictionary would be:

class JsonData(object):pass

data = JsonData()
data.__dict__.update(json.parse(data))

Python: performance of custom JSON decoder

3 Answers3