0

I have a huge json file (around 30 gb) that I need to work with. Decoding it with python's json or cjson modules is too slow.

Is there any way I can either

a) split the file intelligently (not by line, but my json object) or b) decode a json this huge very quickly.

Thanks

user1452494
  • 1,145
  • 5
  • 18
  • 40
  • Have you tried [ujson](https://pypi.python.org/pypi/ujson)? For me it was twice as fast as regular `json` module from stdlib. – Maciej Gol Aug 25 '14 at 07:40
  • 1
    if its twice as fast..it would still take 5 hours by my calculation – user1452494 Aug 25 '14 at 07:42
  • 1
    What's the time you expect the decoding to take? You can't expect a 30GB file to be decoded in seconds. – Maciej Gol Aug 25 '14 at 07:44
  • There may not be that many other options - to split your file logically into "complete" JSON sub-objects you'd have to read (and process) the whole file first, which kind of defeats the purpose. It's not like a text file where you can just read the first `X` bytes, process that, then continue... – MattDMo Aug 25 '14 at 07:45
  • 1
    Obviously. The question is what is the BEST way - that could include splitting the file, as i mentioned – user1452494 Aug 25 '14 at 07:45
  • @MattDMo: assuming a "well-behaved" file, splitting first will be easier and faster than arbitrary parsing: in the best case, it's just matching the parentheses stack. (If the file is just one big object, there's nothing to be gained, of course.) – Ulrich Schwarz Aug 25 '14 at 08:01
  • possible duplicate of [Reading rather large json files in Python](http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python) – simonzack Aug 25 '14 at 10:04
  • possible duplicate of [Is there a memory efficient and fast way to load big json files in python?](http://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python) – Peter O. Aug 26 '14 at 04:26

2 Answers2

0

If you don't know the structure of your JSON file there is little you can do, other than use a faster JSON decoder (e.g. ijson which can do streaming, or ujson).

It may also be that if you need to have all the data in python in memory at the same time, the speed is affected by swapping/not having not enough physical RAM - in which case adding more RAM may help (as obvious as it is, I think it is worth mentioning this).

If you don't need a generic solution, check the structure of the file yourself and see how you can split it. E.g. if it is an array of whatever, it is probably easy to separate array elements, as complex as they might be, manually, and split into chunks of any size.

P.S. You can always test what is the lower bound, by just reading the 30GB file as binary data, discarding the data - if you are reading from the network, network speed may be the bottleneck; if you need to have all that data in memory, just create sample data of the same size, and it may take the same 5 hours due to swapping etc.

asm
  • 56
  • 3
0

Why not split the json into a smaller pieces using cat, just like this :

$ cat data.json | jq -c -M '.data[]' | sed 's/\\"/\\\\"/g' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done
Abdou Tahiri
  • 4,338
  • 5
  • 25
  • 38