6

I am trying to parse a big json file (hundreds of gigs) to extract information from its keys. For simplicity, consider the following example:

import random, string

# To create a random key 
def random_string(length):
        return "".join(random.choice(string.lowercase) for i in range(length))

# Create the dicitonary 
dummy = {random_string(10): random.sample(range(1, 1000), 10) for times in range(15)}

# Dump the dictionary into a json file 
with open("dummy.json", "w") as fp:
        json.dump(dummy, fp)

Then, I use ijson in python 2.7 to parse the file:

file_name = "dummy.json"

with open(file_name, "r") as fp:

    for key in dummy.keys():

        print "key: ", key 

        parser = ijson.items(fp, str(key) + ".item")

        for number in parser:
            print number,

I was expecting to retrieve all the numbers in the lists corresponding to the keys of the dic. However, I got

IncompleteJSONError: Incomplete JSON data

I am aware of this post: Using python ijson to read a large json file with multiple json objects, but in my case I have a single json file, that is well formed, with a relative simple schema. Any ideas on how can I parse it? Thank you.

Abdulrahman Bres
  • 2,603
  • 1
  • 20
  • 39
Paul
  • 165
  • 1
  • 1
  • 13

4 Answers4

6

ijson has an iterator interface to deal with large JSON files allowing to read the file lazily. You can process the file in small chunks and save results somewhere else.

Calling ijson.parse() yields three values prefix, event, value

Some JSON:

{
    "europe": [
      {"name": "Paris", "type": "city"},
      {"name": "Rhein", "type": "river"}
    ]
  }

Code:

import ijson


data = ijson.parse(open(FILE_PATH, 'r'))

for prefix, event, value in data:
    if event == 'string':
        print(value)

Output:

Paris
city
Rhein
river

Reference: https://pypi.python.org/pypi/ijson

Abdulrahman Bres
  • 2,603
  • 1
  • 20
  • 39
  • The above example produces a dictionary for which the parser produces the error I described. This is not the same. – Paul Mar 07 '18 at 14:06
  • 1
    You can not use ijson.items for a large file, it won't read the entire file and error will be thrown – Abdulrahman Bres Mar 07 '18 at 17:23
  • For large file, you need to carefully work with the generator returned by `ijson.items()` or `ijson.parse()` , e.g. you should avoid fetching value by `set(your_generator)` or `list(your_generator)` – Ham May 24 '21 at 07:42
1

You are starting more than one parsing iterations with the same file object without resetting it. The first call to ijson will work, but will move the file object to the end of the file; then the second time you pass the same.object to ijson it will complain because there is nothing to read from the file anymore.

Try opening the file each time you call ijson; alternatively you can seek to the beginning of the file after calling ijson so the file object can read your file data again.

Rodrigo Tobar
  • 569
  • 4
  • 13
0

The sample json content file is given below: it has records of two people. It might as well have 2 million records.

    [
      {
        "Name" : "Joy",
        "Address" : "123 Main St",
        "Schools" : [
          "University of Chicago",
          "Purdue University"
        ],
        "Hobbies" : [
          {
            "Instrument" : "Guitar",
            "Level" : "Expert"
          },
          {
            "percussion" : "Drum",
            "Level" : "Professional"
          }
        ],
        "Status" : "Student",
        "id" : 111,
        "AltID" : "J111"
      },
      {
        "Name" : "Mary",
        "Address" : "452 Jubal St",
        "Schools" : [
          "University of Pensylvania",
          "Washington University"
        ],
        "Hobbies" : [
          {
            "Instrument" : "Violin",
            "Level" : "Expert"
          },
          {
            "percussion" : "Piano",
            "Level" : "Professional"
          }
        ],
        "Status" : "Employed",
        "id" : 112,
        "AltID" : "M112"
      }
      }
    ]

I created a generator which would return each person's record as a json object. The code would look like below. This is not the generator code. Changing couple of lines would make it a generator.

import json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

CypherX
  • 7,019
  • 3
  • 25
  • 37
Phantom
  • 31
  • 2
0
if you are working with json with the following format you can use ijson.item()



sample json:

[
    {"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
    {"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}

]





  input = 'file.txt'
        res=[]
        if Path(input).suffix[1:].lower() == 'gz':
            input_file_handle = gzip.open(input, mode='rb')
        else:
            input_file_handle = open(input, 'rb')

        for json_row in ijson.items(input_file_handle,
                                    'item'):
            res.append(json_row)