0

I have a huge file ~8 GB in JSON and I want to read it as stream with chunks of 1000 examples at a time. So I searched a lot and tried several packages but not of them really did the job.

The format of my file is as follows:

{
    "Elem1": [
       {
            "orgs": [],
       },
       {
           "people":[]
       },
    ],
   "Elem2"":[
       {
            "orgs": [],
       },
       {
           "people":[]
       },
    ],
...
}

As you can see, each element is saved as a tuple with two dicts and reoccurring keys in it. Is there a way how I could read/load/process the file above in chunks of elements i.e. chunk_1 = [ Elem1, Elem2, ... ] into the RAM and get the values for the keys out of them? Any ideas how to do that? Would appreciate your help.

Best regards Chris

ChrisDelClea
  • 307
  • 2
  • 8
  • Check if this helps : https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python – Prakhar Londhe Feb 06 '21 at 13:46
  • I tried it already, with the function. As far as I can tell, it returns a generator, not chunks. So when I did: `data = stream_read_json("my_json_file.json")` and than `next(data)` it still returned all elements at once. Or how could I change that behavior? – ChrisDelClea Feb 06 '21 at 13:54
  • The `json` standard module loads a json object as a whole. Full stop. If you want a different behaviour, you should implement a parser *by hand*. As you know the overall structure of your data, you may implement only what is required and forget all the corner cases. – Serge Ballesta Feb 06 '21 at 14:03
  • What to you mean with implementing a parser by hand? How would that parser look like? – ChrisDelClea Feb 06 '21 at 14:07
  • 1
    Custom parser.. go character by character, push the char in a stack at each '{' and '[' and pop at '}' and ']'.. .Every time you reach empty stack increment a counter by one.. do this until you reach desired count – Prakhar Londhe Feb 06 '21 at 14:50
  • I checked again and I saw it is possible to do `yield from (i:i+n)` however, in my case I have a dictionary. Is there a way to tweak around this problem? So when I do `[elem for elem in data]` it returns a chunk in range 1-n? But how would I slice a dict. – ChrisDelClea Feb 06 '21 at 15:51

1 Answers1

2

As Serge said, you will need a custom parser to do the job. Something like below:

stack = []

json_string = ""
count = 0
with open(filename) as f:
  while True:
    c = f.read(1)
    if c == '{' or c == '[':
      stack.append(c)
    elif c == '}' or c == ']':
      stack.pop()
    json_string += c
    if len(stack) == 1:
      json_string += '}'
      count += 1
    if count == DESIRED_COUNT :
      break

The final json_string will contain the json with DESIRED_COUNT of objects

Prakhar Londhe
  • 1,431
  • 1
  • 12
  • 26