How to read a JSON file in python as stream in chunks with specific format

Question

I have a huge file ~8 GB in JSON and I want to read it as stream with chunks of 1000 examples at a time. So I searched a lot and tried several packages but not of them really did the job.

The format of my file is as follows:

{
    "Elem1": [
       {
            "orgs": [],
       },
       {
           "people":[]
       },
    ],
   "Elem2"":[
       {
            "orgs": [],
       },
       {
           "people":[]
       },
    ],
...
}

As you can see, each element is saved as a tuple with two dicts and reoccurring keys in it. Is there a way how I could read/load/process the file above in chunks of elements i.e. chunk_1 = [ Elem1, Elem2, ... ] into the RAM and get the values for the keys out of them? Any ideas how to do that? Would appreciate your help.

Best regards Chris

Check if this helps : https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python — Prakhar Londhe, Feb 06 '21 at 13:46
I tried it already, with the function. As far as I can tell, it returns a generator, not chunks. So when I did: `data = stream_read_json("my_json_file.json")` and than `next(data)` it still returned all elements at once. Or how could I change that behavior? — ChrisDelClea, Feb 06 '21 at 13:54
The `json` standard module loads a json object as a whole. Full stop. If you want a different behaviour, you should implement a parser *by hand*. As you know the overall structure of your data, you may implement only what is required and forget all the corner cases. — Serge Ballesta, Feb 06 '21 at 14:03
What to you mean with implementing a parser by hand? How would that parser look like? — ChrisDelClea, Feb 06 '21 at 14:07
Custom parser.. go character by character, push the char in a stack at each '{' and '[' and pop at '}' and ']'.. .Every time you reach empty stack increment a counter by one.. do this until you reach desired count — Prakhar Londhe, Feb 06 '21 at 14:50
I checked again and I saw it is possible to do `yield from (i:i+n)` however, in my case I have a dictionary. Is there a way to tweak around this problem? So when I do `[elem for elem in data]` it returns a chunk in range 1-n? But how would I slice a dict. — ChrisDelClea, Feb 06 '21 at 15:51

score 2 · Answer 1 · answered Feb 06 '21 at 14:59

As Serge said, you will need a custom parser to do the job. Something like below:

stack = []

json_string = ""
count = 0
with open(filename) as f:
  while True:
    c = f.read(1)
    if c == '{' or c == '[':
      stack.append(c)
    elif c == '}' or c == ']':
      stack.pop()
    json_string += c
    if len(stack) == 1:
      json_string += '}'
      count += 1
    if count == DESIRED_COUNT :
      break

The final json_string will contain the json with DESIRED_COUNT of objects

How to read a JSON file in python as stream in chunks with specific format

1 Answers1