3

I have a .json file where each line is an object. For example, first two lines are:

{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}

{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}

I have tried processing using ijson lib as follows:

with open(filename, 'r') as f:
    objects = ijson.items(f, 'columns.items')
    columns = list(objects) 

However, i get error:

JSONError: Additional data

Its seems due to multiple objects I'm receiving such error.

Whats the recommended way for analyzing such Json file in Jupyter?

Thank You in advance

xdze2
  • 3,986
  • 2
  • 12
  • 29
rohan
  • 527
  • 1
  • 6
  • 19

3 Answers3

3

The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:

[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]

Here is some code how to clean your file:

lastline = None

with open("yourfile.json","r") as f:
    lineList = f.readlines()
    lastline=lineList[-1]

with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
    for i,line in enumerate(f,0):
        if i == 0:
            line = "["+str(line)+","
            g.write(line)
        elif line == lastline:            
            g.write(line)
            g.write("]")
        else:
            line = str(line)+","
            g.write(line)

To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).

import pandas as pd

#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")

If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:

df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2 
WurzelseppQX
  • 520
  • 1
  • 6
  • 17
  • 1
    Issue is I have a quite a large json file and causing memory error if I attempt to do pandas read_json. Hence, I am attempting to follow instructions from https://www.dataquest.io/blog/python-json-tutorial/. Then again, file format is incorrect. I should find a way to wrap them in a square brackets separated by commas. – rohan Aug 08 '18 at 19:33
  • I just added some code that you could use to create a new json file in the correct format. However if your file really is too big for pandas, I fear that it will take a while. – WurzelseppQX Aug 08 '18 at 20:15
  • and it is pretty fast than i expected – rohan Aug 09 '18 at 04:04
  • was this your accepted answer, or did you still had some issues? – WurzelseppQX Mar 01 '19 at 07:20
2

While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.

You can aggregate these objects in one list, and from there do whatever you like with your data :

import json
with open(filename, 'r') as f:
    object_list = []
    for line in f.readlines():
        object_list.append(json.loads(line))
    # object_list will contain all of your file's data

You could do it as a list comprehension to have it a little more pythonic :

with open(filename, 'r') as f:    
    object_list = [json.loads(line) 
                   for line in f.readlines()]
    # object_list will contain all of your file's data
A-y
  • 793
  • 5
  • 16
  • 1
    I am not sure it's fair to say it's an invalid JSON file. For example, in my usecase instead of a file I have a network socket. Now what, would you say I have an invalid JSON socket? :) [This answer](https://stackoverflow.com/a/43807246/2388257#parse-multiple-json-objects-that-are-in-one-line) works for me though. – Hi-Angel Jun 01 '20 at 13:24
  • JSON only allows one object at the root of the document. You have two. That's objectively invalid json. Now, that being said, it's a pretty common thing to have newline-delimited JSON. You simply need to read each line as a different JSON object, that is all. Treating the whole thing as a single object will fail because you have more than one. – A-y Jun 02 '20 at 18:45
  • 1
    The answer you have linked is a more complex solution that is warranted when there is no delimiter between objects. If you have the luxury of having newlinew between your objects, you should leverage it. – A-y Jun 02 '20 at 18:50
1

You have multiple lines in your file, so that's why it's throwing errors

import json

with open(filename, 'r') as f:
    lines = f.readlines()
    first = json.loads(lines[0])
    second = json.loads(lines[1])

That should catch both lines and load them in properly

C.Nivs
  • 12,353
  • 2
  • 19
  • 44