9

I am looking to implement a streaming json parser for a very, very large JSON file (~ 1TB) that I'm unable to load into memory. One option is to use something like https://github.com/stedolan/jq to convert the file into json-newline-delimited, but there are various other things I need to do to each json object, that makes this approach not ideal.

Given a very large json object, how would I be able to parse it object-by-object, similar to this approach in xml: https://www.ibm.com/developerworks/library/x-hiperfparse/index.html.

For example, in pseudocode:

with open('file.json','r') as f:
    json_str = ''
    for line in f: # what if there are no newline in the json obj?
        json_str += line
        if is_valid(json_str):
            obj = json.loads(json_str)
            do_something()
            json_str = ''

Additionally, I did not find jq -c to be particularly fast (ignoring memory considerations). For example, doing json.loads was just as fast (and a bit faster) than using jq -c. I tried using ujson as well, but kept getting a corruption error which I believe was related to the file size.

# file size is 2.2GB
>>> import json,time
>>> t0=time.time();_=json.loads(open('20190201_itunes.txt').read());print (time.time()-t0)
65.6147990227

$ time cat 20190206_itunes.txt|jq -c '.[]' > new.json
real    1m35.538s
user    1m25.109s
sys 0m15.205s

Finally, here is an example 100KB json input which can be used for testing: https://hastebin.com/ecahufonet.json

  • 1
    As far as I'm aware, there is no good story in Python here. The ideal approach would be to change the process which generates 1TB json blob to use a more convenient format for streaming such as [jsonlines](http://jsonlines.org) – wim Feb 06 '19 at 18:25
  • @wim these are user/client-generated files, so I have no control over them. –  Feb 06 '19 at 18:25
  • why cant you try pd.read_json() with chunksize option ? – Nusrath Feb 06 '19 at 18:28
  • Can you use something like `jq` and preprocess the document so it's valid JSONLines? – yorodm Feb 06 '19 at 18:33
  • @Nusrath could you please clarify how you'd use the `chunksize` option approach? Is it able to work with the above input? –  Feb 07 '19 at 19:56
  • How consistent are the JSON objects? Viewing your data file, it appears to be a single object that encompasses the entire file. If that's the case, you may be able to write a rudimentary stream parser that splits the objects that compose the single gigantic outer object into newline-delimited JSON and parse it per @EilifMikkelsen's answer – Andrew Henle Feb 08 '19 at 21:32
  • You want to look at [`ijson`](https://pypi.org/project/ijson/), a streaming JSON parser. – Martijn Pieters Feb 08 '19 at 23:33

2 Answers2

-1

Consider converting this json into filesystem tree (folders & files), So that every json object is converted to a folder, that contains files:

  • name.txt - contains name of the property in parent folder (json-object), value of the property is the current folder (json-object)
  • properties_000000001.txt
  • properties_000000002.txt

    ....

every properties_X.txt file contains at most N (limited number) lines property_name: property_value:

  • "number_property": 100
  • "boolean_property": true
  • "object_property": folder(folder_0000001)
  • "array_property": folder(folder_000002)

folder_0000001, folder_000002 - names of local folders

every array is converted to a folder with files:

  • name.txt
  • elements_0000000001.txt
  • elements_0000000002.txt

    ....

jnr
  • 790
  • 1
  • 7
  • 9
  • 4
    just so I'm clear, you're suggesting to creating potentially billions and billions of files/folders just to store the json? That sounds a bit impractical...And might even take longer just to clean up the files once things are done. Or am I missing something? –  Feb 07 '19 at 18:18
  • You said it's quite large json (~ 1TB). Also you mentioned converting the file into json-newline-delimited but can entire json become one line after such converting? I proposed to store it as folders&files as you can keep a small view of the entire json in your program while storing the rest on disk. It's not clear if the depth of json file is going to be limited (depth of nestings {{{}}}) as well as what you plan to do with parsed json.or during parsing/navigating through the json. – jnr Feb 08 '19 at 10:38
-2

If the file contains one large JSON object (either array or map), then per the JSON spec, you must read the entire object before you can access its components.

If for instance the file is an array with objects [ {...}, {...} ] then newline delimited JSON is far more efficient since you only have to keep one object in memory at a time and the parser only has to read one line before it can begin processing.

If you need to keep track of some of the objects for later use during parsing, I suggest creating a dict to hold those specific records of running values as your iterate the file.

Say you have JSON

{"timestamp": 1549480267882, "sensor_val": 1.6103881016325283}
{"timestamp": 1549480267883, "sensor_val": 9.281329310309406}
{"timestamp": 1549480267883, "sensor_val": 9.357327083443344}
{"timestamp": 1549480267883, "sensor_val": 6.297722749124474}
{"timestamp": 1549480267883, "sensor_val": 3.566667175421604}
{"timestamp": 1549480267883, "sensor_val": 3.4251473635178655}
{"timestamp": 1549480267884, "sensor_val": 7.487766674770563}
{"timestamp": 1549480267884, "sensor_val": 8.701853236245032}
{"timestamp": 1549480267884, "sensor_val": 1.4070662393018396}
{"timestamp": 1549480267884, "sensor_val": 3.6524325449499995}
{"timestamp": 1549480455646, "sensor_val": 6.244199614422415}
{"timestamp": 1549480455646, "sensor_val": 5.126780276231609}
{"timestamp": 1549480455646, "sensor_val": 9.413894020722314}
{"timestamp": 1549480455646, "sensor_val": 7.091154829208067}
{"timestamp": 1549480455647, "sensor_val": 8.806417239029447}
{"timestamp": 1549480455647, "sensor_val": 0.9789474417767674}
{"timestamp": 1549480455647, "sensor_val": 1.6466189633300243}

You can process this with

import json
from collections import deque

# RingBuffer from https://www.daniweb.com/programming/software-development/threads/42429/limit-size-of-a-list
class RingBuffer(deque):
    def __init__(self, size):
        deque.__init__(self)
        self.size = size

    def full_append(self, item):
        deque.append(self, item)
        # full, pop the oldest item, left most item
        self.popleft()

    def append(self, item):
        deque.append(self, item)
        # max size reached, append becomes full_append
        if len(self) == self.size:
            self.append = self.full_append

    def get(self):
        """returns a list of size items (newest items)"""
        return list(self)


def proc_data():
    # Declare some state management in memory to keep track of whatever you want
    # as you iterate through the objects
    metrics = {
        'latest_timestamp': 0,
        'last_3_samples': RingBuffer(3)
    }

    with open('test.json', 'r') as infile:        
        for line in infile:
            # Load each line
            line = json.loads(line)
            # Do stuff with your running metrics
            metrics['last_3_samples'].append(line['sensor_val'])
            if line['timestamp'] > metrics['latest_timestamp']:
                metrics['latest_timestamp'] = line['timestamp']

    return metrics

print proc_data()
Eilif Mikkelsen
  • 326
  • 3
  • 7
  • @Elif -- could you show an example of using the last method, with some test json? –  Feb 06 '19 at 19:04
  • @Elif -- thanks for the update, but this is already given a json-newline item. Suppose it is a json object with zero newlines in it. That is the scenario I have. -- So this line would not work: `for line in infile: line = json.loads(line)` –  Feb 06 '19 at 19:21
  • You mentioned "One option is to use something like https://github.com/stedolan/jq to convert the file into json-newline-delimited, but there are various other things I need to do to each json object, that makes this approach not ideal." My proposed pattern using `metrics` presumes that you have already converted to newline delimited json using `cat a.json | jq -c '.[]'` . If the processing pattern works for your use case then you should use newline delimited JSON going forward. I apologize for the confusion. – Eilif Mikkelsen Feb 06 '19 at 19:23
  • sorry about the confusion. I've added a sample 100K file in the question for sample data to use against, if that's helpful. Thanks! –  Feb 06 '19 at 19:30
  • Thank you for adding the file. I don't understand how you would like me to change the example. Run `cat ./ecahufonet.json | jq -c '.[]' > ./ecahufonet_lines.json` and then follow the pattern in my example. The initial JQ operation will take some time however subsequent analysis will be blazing fast. – Eilif Mikkelsen Feb 06 '19 at 19:55
  • I mean to use it entirely without `jq`. jq doesn't work for what we need to do, so to start from the 'normal' json and to get to a json-delimited item without using that utility. –  Feb 07 '19 at 19:55
  • I have to give -1 because assuming the streaming json is newline separated is a good way to introduce bugs. – Winny Jun 18 '20 at 23:19
  • 3
    Where does the JSON spec say that you must read the entire object before you can access its components? (@EilifMikkelsen, thanks for your answer, but I wonder whether that statement is correct. Is all JSON streaming processing not JSON spec conform?) – DaveFar Jun 15 '21 at 10:00
  • I think the problem is going to be if you have lets say `a: { b: [ ginormous_array ], c: { else } }` If you stream the json, technically the `a` object is incomplete. Javascript would always assume a object is a hashtable and that 'c' should be acessible. But this isn't javascript. – Luiz Felipe May 11 '22 at 17:30
  • It doesn't say anything about an object being incomplete at some point during the parsing. Although it says that you shouldn't rely on the order of elements, which may require for at least an entire single pass scanning to get any single object by path, like `a.c` . https://datatracker.ietf.org/doc/html/rfc7159#section-4 – Luiz Felipe May 11 '22 at 17:38
  • 1
    So if you are scanning for sibling objects of `b`, you have to parse the entire file, but if you are only interested in items inside `b`'s array, its totally valid for `a` to be a incomplete object (and `b` a incomplete array, which you are streaming, that's the purpose ). With that in mind, streaming json is totally valid as long as you accept that objects may be incomplete. – Luiz Felipe May 11 '22 at 17:40