0

I'm trying to parse a huge JSON file (14GB approx.) on Python to do some data mining research I'm working on.

The problem is that when I use the built-in JSON module, it tries to load the full file on memory, until it runs out of it.

Of course, I could find a machine that can stand this file on RAM, in fact I have, but this is not a nice way of doing it.

What I have tried:

import json

with open('myfile.json', 'r'):
    loaded_json = json.load(file)
    # ...do stuff

What I would like is a way of using this file with the regular JSON interface of lists and dicts, but this way should processes the file directly from disk or by chunks on memory.

Thanks!

0xfede7c8
  • 130
  • 8
  • 1
    Can you provide the file? No, just kidding. Can you provide a small sample? If the file has more than one line, you could read it line-by-line. – masterfloda Apr 11 '18 at 21:25
  • You can easily implement your own parser that consumes the file on demand (consuming one token at time). The JSON grammar is pretty simple and won't give you any problems. – Gabriel Apr 11 '18 at 21:30
  • From the duplicate tags I found ijson and json-streamer that supposedly are libraries that do what I need. Thanks for the suggestions!! but I do not want to re-invent the wheel hehe. – 0xfede7c8 Apr 11 '18 at 21:40

0 Answers0