2

I am trying to load a large json file (around 4G) as a pandas dataframe, but the following method does not work for file > around 2G. Is there any alternative method?

data_dir = 'data.json' my_data = pd.read_json(data_dir, lines = True)

I tried ijson but have no idea how to covert it to a dataframe.

Howell Yu
  • 73
  • 8
  • 2
    What's your ram? Did you try the built in `json.loads`? – Or Duan Jul 11 '17 at 07:45
  • Are you using 32 bit or 64 bit python? – Jonas Adler Jul 11 '17 at 08:22
  • 1
    @JonasAdler I'm going to go ahead with the assumption that he's using 32-bit python the [~2GB limit](https://stackoverflow.com/a/639562/4022608) would too much of a coincidence otherwise. – Baldrickk Jul 11 '17 at 08:51
  • To the comments above, I am using a 64-bit with 8GB and I still had 55% left so ideally it should work :). Anyway, thanks to your advice with `json.loads` it's working now. – Howell Yu Jul 11 '17 at 09:01
  • 1
    it is not because the file on disk is 4GB that the representation in memory is 4GB. Python creates an object for every string which might take more place than on disk. – Maarten Fabré Jul 11 '17 at 09:42

1 Answers1

1

Loading the Large document in memory may not be best approach in this case. This size of JSON may require you to use a different approach for parsing. Try using Streaming parsers instead. Some options

https://pypi.org/project/json-stream-parser/

https://pypi.org/project/ijson/

The key is to not load the entire document in memory. This is similar to SAX parsing in the XML world.

I am not a python expert, however, there should be a good library that can already do it for you.