how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

Question

I am trying to retrieve the names of the people from my file. The file size is 201GB

import json

with open("D:/dns.json", "r") as fh:
    for l in fh:
        d = json.loads(l)
        print(d["name"])

Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.

Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.

Here is sample: test.json

Every line is seperated by newline. Hope this helps.

Well that depends on the structure of your file. You know that `open` gives you an iterator over lines, right? So the line where memory blows up is the one with `readlines`. Since your code indicates that the file holds JSON data, could you even make sense of individual chunks? Lastly: 201GB, holy shit. — timgeb, Apr 07 '17 at 13:58
@timgeb Yes, what I can do. I have my client data in the file and need to check with the names — Jaffer Wilson, Apr 07 '17 at 13:59
Well, I have removed the `readlines` still got the same memory error — Jaffer Wilson, Apr 07 '17 at 14:00
@JafferWilson now memory blows up at `json.loads`. What did you expect? :) — timgeb, Apr 07 '17 at 14:01
@timgeb What do you need more please let me know I will add. But please do not say you need 201 GB for testing .. that is quite not possible for me... :P — Jaffer Wilson, Apr 07 '17 at 14:01
I think I understand correctly know. First of all I would change the title to "how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?" — timgeb, Apr 07 '17 at 14:02
@timgeb so is there no mechanism that could be planted in the program which could be helpful to me.... — Jaffer Wilson, Apr 07 '17 at 14:02
@timgeb May be you can change. But is there any mechanism that could take small small splitted lines .. process them and write to another file, single file. — Jaffer Wilson, Apr 07 '17 at 14:04
Sure there is, but the problem here is that you basically have a repr of a dictionary which would be tricky to parse in chunks. I don't know, I find the question interesting. — timgeb, Apr 07 '17 at 14:06
@JafferWilson Is the format of your json file "single record per line" ? Each line containing a json record? — Himaprasoon, Apr 07 '17 at 14:06
Problem is if it's a json file, how repeatable is it? is it 201gb of small 10k chunks of the same data or is it one huge massive chunk of data wrapped in {}? If it's all repeatable you might be able to chunk it and pass the chucks into a generator but it all depends on your data format. hoooo boy. — Keef Baker, Apr 07 '17 at 14:07
@KeefBaker Believe me... There is no repetition of data in the file, except 10 lines.. I suppose.. but no more than that for sure... — Jaffer Wilson, Apr 07 '17 at 14:08
@JafferWilson can you show few lines from the file. (Just to verify its single record per line ) — Himaprasoon, Apr 07 '17 at 14:09
@timgeb Thank you for showing your interest in my question, but is there is any solution in your mind... I will be grateful...:) — Jaffer Wilson, Apr 07 '17 at 14:09
if it's one line this might help... http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory you could put that in a generator and use yield for each piece maybe — Keef Baker, Apr 07 '17 at 14:10
@JafferWilson not at the moment. As other people have pointed out as well, we need more info about the structure of your file. Maybe we can divide it into logical chunks, maybe not. — timgeb, Apr 07 '17 at 14:11
@timgeb ok I will add a small lines with this question. but not the complete 201 GB.. it is damn hard for me to share ... :P — Jaffer Wilson, Apr 07 '17 at 14:17
@JafferWilson sure, the content is not important, the structure is. We don't need the 201 GB (please). — timgeb, Apr 07 '17 at 14:17

score 1 · Answer 1 · answered Apr 07 '17 at 14:25

1

You may want to give ijson a try : https://pypi.python.org/pypi/ijson

answered Apr 07 '17 at 14:25

bruno desthuilliers

75,974
6
88
118

How much memory efficient this library is? How fast? for 201 GBs? – Jaffer Wilson Apr 07 '17 at 14:44
2

@JafferWilson don't be lazy, try it out. – timgeb Apr 07 '17 at 14:49
You'll have to try it by yourself. But it's a sax parser so the whole point is that you _dont_ have to read the whole file in memory - there's an example in the FineManual FWIW – bruno desthuilliers Apr 07 '17 at 14:49
@timgeb please don't take me wrong... I have been trying to solve this problem since 2 days and I am sleepless.... I don't know whether you can imagine in what situation I am right now... :) – Jaffer Wilson Apr 07 '17 at 14:56

score 0 · Answer 2 · answered Apr 07 '17 at 14:22

0

Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

answered Apr 07 '17 at 14:22

holdenweb

33,305
7
57
77

there's at least one and it's a python package : https://pypi.python.org/pypi/ijson – bruno desthuilliers Apr 07 '17 at 14:25
But how much memory efficient the library is? Can you imagine 201 GB at a time.. My system is going to die.... :P... Can you modify my example for better understanding please. – Jaffer Wilson Apr 07 '17 at 14:33
@holdenweb FWIW ijson is based on YAJL (http://lloyd.github.io/yajl/) which is pure C so there might be bindings for other languages too. – bruno desthuilliers Apr 07 '17 at 14:50

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

2 Answers2