python ijson not working on multiple element at once

Question

I have thousands of very large JSON files that I need to process on specific elements. To avoid memory overload I am using a python library called ijson which works fine when I am processing only a single element from the json file but when I try to process multiple-element at once it throughs

IncompleteJSONError: parse error: premature EOF

Partial JSON:

{
  "info": {
    "added": 1631536344.112968,
    "started": 1631537322.81162,
    "duration": 14,
    "ended": 1631537337.342377
  },
  "network": {
    "domains": [
      {
        "ip": "231.90.255.25",
        "domain": "dns.msfcsi.com"
      },
      {
        "ip": "12.23.25.44",
        "domain": "teo.microsoft.com"
      },
      {
        "ip": "87.101.90.42",
        "domain": "www.msf.com"
      }
    ]
  }
}

Working Code: (Multiple file open)

my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
    row = {}
    with open(filename, 'r') as f:
        info = ijson.items(f, 'info')
        for o in info:
             row['added']= float(o.get('added'))
             row['started']= float(o.get('started'))
             row['duration']= o.get('duration')
             row['ended']= float(o.get('ended'))
    
    with open(filename, 'r') as f:
        domains = ijson.items(f, 'network.domains.item')
        domain_count = 0
        for domain in domains:
            domain_count+=1
        row['domain_count'] = domain_count

Failure Code: (Single file open)

my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
    row = {}
    with open(filename, 'r') as f:
        info = ijson.items(f, 'info')
        for o in info:
             row['added']= float(o.get('added'))
             row['started']= float(o.get('started'))
             row['duration']= o.get('duration')
             row['ended']= float(o.get('ended'))
    
        domains = ijson.items(f, 'network.domains.item')
        domain_count = 0
        for domain in domains:
            domain_count+=1
        row['domain_count'] = domain_count

Not sure this is the reason Using python ijson to read a large json file with multiple json objects that ijson not able to work on multiple json element at once.

Also, let me know any other python package or any sample example that can handle large size json without memory issues.

I used https://pypi.org/project/json-stream/ for this task in my past. Worked really well. Might try it if you don't get help with ijson. nvm. I think this is a TextIO reading issue. Details in answers section. — Alex, Dec 04 '21 at 12:54

score 5 · Accepted Answer · answered Dec 04 '21 at 12:58

5

I think this is happening because you've finished reading your IO stream from the file, you're at the end already, and already asking for another query.

What you can do is to reset the cursor to the 0 position before the second query:

f.seek(0)

In a comment I said that you should try json-stream as well, but this is not an ijson or json-stream bug, it's a TextIO feature.

This is the equivalent of you opening the file a second time.

If you don't want to do this, then maybe you should look at iterating through every portion of the JSON, and then deciding for each object whether it has info or network.domains.item.

answered Dec 04 '21 at 12:58

Alex

14,338
5
41
59

the `f.seek(0)` seems working :) – A l w a y s S u n n y Dec 04 '21 at 13:03
btw, I have another query, can you suggest how can I get the element counts without iterating it on ijson, e.g to get `domain_count` i,e `3`, not willing to use a for loop to get only the element counts – A l w a y s S u n n y Dec 04 '21 at 13:04
Well, since you still have to parse and identify those elements, it appears that iterating over them is the only way to do it. You could think about loading the data in a more structured format. – Alex Dec 04 '21 at 13:56
please have a look : https://stackoverflow.com/questions/70290769/ijson-not-working-for-multiple-object-in-large-json-file – A l w a y s S u n n y Dec 09 '21 at 14:08

score 1 · Answer 2 · answered Dec 04 '21 at 13:21

1

While the answer above is correct, you can do better: if you know the structure of your JSON file and can rely on it, you can use this to your advantage and read the file only once.

ijson has an even interception mechanism, and the example there is very similar to what you want to achieve. In your case you want to get the info values, then iterate over the network.domains.item and count them. This should do:

row = {}
with open(filename, 'r') as f:
    parse_events = ijson.parse(f, use_float=True)
    for prefix, event, value in parse_events:
        if prefix == 'info.added':
            row['added'] = value
        elif prefix == 'info.started':
            row['started'] = value
        elif prefix == 'info.duration':
             row['duration'] = value
        elif prefix == 'info.ended':
             row['ended'] = value
        elif prefix == 'info' and event == 'end_map':
            break
    row['domain_count'] = sum(1 for _ in ijson.items(parse_events, 'network.domains.item'))

Note how:

ijson.items is fed with the result of ijson.parse.
use_float=True saves you from having to convert the values to float yourself.
The counting can be done by sum()-ing 1 for each item coming from ijson.items so you don't have to loop yourself manually.

answered Dec 04 '21 at 13:21

Rodrigo Tobar

569
4
13

thanks a lot for your answer sir. definitely try this one also. – A l w a y s S u n n y Dec 04 '21 at 13:24
just one query, what if I want to get the `value` and iterate it also because I have some nested array/object element that has to parse later in my json – A l w a y s S u n n y Dec 04 '21 at 13:25
the above solution works but multiple row sum doesn't work, only domain_count returns expected result other counts returns 0, but If I comment out domain_count then it counts perfectly – A l w a y s S u n n y Dec 04 '21 at 14:56
The code you shared is working fine but when I try to add another field count after `row['domain_count']` it always returns 0, e.g if I try `row['network_count']` it returns 0 but if comment out first one then it returns correct count. – A l w a y s S u n n y Dec 09 '21 at 01:05
please have a look : https://stackoverflow.com/questions/70290769/ijson-not-working-for-multiple-object-in-large-json-file – A l w a y s S u n n y Dec 09 '21 at 14:08

python ijson not working on multiple element at once

2 Answers2