3

I am trying to retrieve the names of the people from my file. The file size is 201GB

import json

with open("D:/dns.json", "r") as fh:
    for l in fh:
        d = json.loads(l)
        print(d["name"])

Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.

Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.

Here is sample: test.json

Every line is seperated by newline. Hope this helps.

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • Not call `readlines`? `for l in fh: ...` – Moses Koledoye Apr 07 '17 at 13:57
  • Don't use readlines(). – polku Apr 07 '17 at 13:57
  • Well that depends on the structure of your file. You know that `open` gives you an iterator over lines, right? So the line where memory blows up is the one with `readlines`. Since your code indicates that the file holds JSON data, could you even make sense of individual chunks? Lastly: 201GB, holy shit. – timgeb Apr 07 '17 at 13:58
  • @timgeb Yes, what I can do. I have my client data in the file and need to check with the names – Jaffer Wilson Apr 07 '17 at 13:59
  • Well, I have removed the `readlines` still got the same memory error – Jaffer Wilson Apr 07 '17 at 14:00
  • @JafferWilson now memory blows up at `json.loads`. What did you expect? :) – timgeb Apr 07 '17 at 14:01
  • @timgeb What do you need more please let me know I will add. But please do not say you need 201 GB for testing .. that is quite not possible for me... :P – Jaffer Wilson Apr 07 '17 at 14:01
  • 1
    I think I understand correctly know. First of all I would change the title to "how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?" – timgeb Apr 07 '17 at 14:02
  • @timgeb so is there no mechanism that could be planted in the program which could be helpful to me.... – Jaffer Wilson Apr 07 '17 at 14:02
  • @timgeb May be you can change. But is there any mechanism that could take small small splitted lines .. process them and write to another file, single file. – Jaffer Wilson Apr 07 '17 at 14:04
  • Sure there is, but the problem here is that you basically have a repr of a dictionary which would be tricky to parse in chunks. I don't know, I find the question interesting. – timgeb Apr 07 '17 at 14:06
  • @JafferWilson Is the format of your json file "single record per line" ? Each line containing a json record? – Himaprasoon Apr 07 '17 at 14:06
  • @Himaprasoon Ye it is for sure... – Jaffer Wilson Apr 07 '17 at 14:07
  • Problem is if it's a json file, how repeatable is it? is it 201gb of small 10k chunks of the same data or is it one huge massive chunk of data wrapped in {}? If it's all repeatable you might be able to chunk it and pass the chucks into a generator but it all depends on your data format. hoooo boy. – Keef Baker Apr 07 '17 at 14:07
  • @KeefBaker Believe me... There is no repetition of data in the file, except 10 lines.. I suppose.. but no more than that for sure... – Jaffer Wilson Apr 07 '17 at 14:08
  • @JafferWilson can you show few lines from the file. (Just to verify its single record per line ) – Himaprasoon Apr 07 '17 at 14:09
  • @timgeb Thank you for showing your interest in my question, but is there is any solution in your mind... I will be grateful...:) – Jaffer Wilson Apr 07 '17 at 14:09
  • if it's one line this might help... http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory you could put that in a generator and use yield for each piece maybe – Keef Baker Apr 07 '17 at 14:10
  • @JafferWilson not at the moment. As other people have pointed out as well, we need more info about the structure of your file. Maybe we can divide it into logical chunks, maybe not. – timgeb Apr 07 '17 at 14:11
  • @timgeb ok I will add a small lines with this question. but not the complete 201 GB.. it is damn hard for me to share ... :P – Jaffer Wilson Apr 07 '17 at 14:17
  • @JafferWilson sure, the content is not important, the structure is. We don't need the 201 GB (please). – timgeb Apr 07 '17 at 14:17
  • @timgeb :)... .sure just adding.. – Jaffer Wilson Apr 07 '17 at 14:18
  • Please check the edited question... :) – Jaffer Wilson Apr 07 '17 at 14:26

2 Answers2

1

You may want to give ijson a try : https://pypi.python.org/pypi/ijson

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
0

Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

holdenweb
  • 33,305
  • 7
  • 57
  • 77