15

I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. However, after writing some code like this,

with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix == "name":
            print(value)

I found that the code parses only the first line and not the rest of the lines from the file!!

Here is how a portion of my json file looks like:

{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}

In my opinion, I think ijson parses only one json object.

Can someone please suggest how to work around this?

mikek3332002
  • 3,546
  • 4
  • 37
  • 47
Boubouh Karim
  • 448
  • 1
  • 8
  • 21

2 Answers2

13

Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:

# each JSON is small, there's no need in iterative processing
import json 
with open(filename, 'r') as f:
    for line in f:
        data = json.loads(line)
        # data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
        # contain correspoding values
user3159253
  • 16,836
  • 3
  • 30
  • 56
  • 1
    Thanks for answering, i'm asking if this will not load the hole file into RAM ? , if it loads only one line at time, so this is awesome – Boubouh Karim May 13 '16 at 03:13
  • certainly `for line in f:` reads one line a time. Check http://stackoverflow.com/questions/17246260/python-readlines-usage-and-efficient-practice-for-reading – user3159253 May 13 '16 at 03:24
  • How can I handle custom en- and decoding in ijson? I can do this rather easily with json and the cls= argument, how is it done in ijson? Any links? Thanks! – gilgamash Oct 28 '20 at 09:13
11

Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data". See bug reports here:

It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:

import io
import ijson

with open(filename, encoding="UTF-8") as json_file:
    cursor = 0
    for line_number, line in enumerate(json_file):
        print ("Processing line", line_number + 1,"at cursor index:", cursor)
        line_as_file = io.StringIO(line)
        # Use a new parser for each line
        json_parser = ijson.parse(line_as_file)
        for prefix, type, value in json_parser:
            print ("prefix=",prefix, "type=",type, "value=",value)
        cursor += len(line)

You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate() from: Accessing the index in 'for' loops?

Mr-IDE
  • 7,051
  • 1
  • 53
  • 59
  • Thanks @Mr-IDE. I finally able to read something from my 5.5Gb datasets using ijson. Managed to sneak some info from it such as dataID, status, values ,location. Question, how to read through all needed info at once for intance "location"?? – Azam Aug 27 '22 at 01:52