How to read a large json in pandas?

Question

My code is :data_review=pd.read_json('review.json') I have the data review as fllow:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

But I got the follow error:

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

My jsonfile do not contain any comments and 3.8G! I just download the file from here to practice link

When I use the follow code,throw the same error:

import json
with open('review.json') as json_file:
    data = json.load(json_file)

There is something wrong with your path / file-argument. Make sure the file exists in the folder you are running python from. Maybe add more details on how you call this script and from where. — sascha, Oct 17 '17 at 12:43
You cannot have comments in a json file: https://stackoverflow.com/questions/244777/can-comments-be-used-in-json Can you try running the code with a clean .json file? — Lukas Ansteeg, Oct 17 '17 at 12:53
@LukasAnsteeg I'm pretty sure it's never parsing the json due to some error before. — sascha, Oct 17 '17 at 12:56
Since the error occurs at line 335, which is not the one you posted above, could you maybe post the surrounding code snippet? — Lukas Ansteeg, Oct 17 '17 at 13:04
@LukasAnsteeg This is probably the code of pandas' read_json. — sascha, Oct 17 '17 at 13:06
@LukasAnsteegThanks a lot ,my jsonfile do not contain comment and the error throw line 355 is the `read_json` code . — ileadall42, Oct 17 '17 at 15:33
Have you tried `data_review=pd.read_json(open('review.json'))` ? — scnerd, Oct 17 '17 at 16:01
@scnerd Yes, I tried but get the same error.Is it the json file inner mistakes?I just download the file from here to practice [link](https://www.yelp.com/dataset/documentation/json) — ileadall42, Oct 17 '17 at 16:15
Have you tried updating pandas? Or using the `json` module to load the data, then create a dataframe directly from that? — scnerd, Oct 17 '17 at 16:34
@scnerd yeah,I have been tried the`ijson` ,but throw the `Additional data'`error — ileadall42, Oct 17 '17 at 16:36
@cricket_007 That is the demo of the data and the json file do not contain any commnets — ileadall42, Oct 18 '17 at 01:54
Got it. Confused on what you copied then... Well, a database is a reasonable alternative to a file. https://stackoverflow.com/a/2402423/2308683 Another solution is distributed programming solutions like Dask or Spark - common solutions for dealing with data that doesn't fit entirely in memory. Yelp uses Hadoop internally — OneCricketeer, Oct 18 '17 at 02:00

Shaurya Mittal · Accepted Answer · 2017-11-22T02:26:30.577

Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.

From the yelp dataset I have seen, your file must be containing something like:

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

Hence, it is important to realize that this is not single json data rather it is multiple json objects in one file.

To read this data into pandas data frame the following solution should work:

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

Assuming the size of data to be pretty large, I think your machine will take considerable amount of time to load the data into data frame.

any solution for big json file which has a json per line without forloop in pandas? — devssh, Jun 06 '18 at 09:57
@devssh, see the answer below! Just pass in `lines=True` and a `chunksize=` to pandas.read_json. You'll still need to loop over the JsonReader it returns to access the file contents, but you must take some approach like that to avoid loading the entire file into memory. Some details: http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#line-delimited-json — Chris, Mar 23 '19 at 01:31

score 11 · Answer 2 · answered Oct 06 '18 at 02:35

11

If you don't want to use a for-loop, the following should do the trick:

import pandas as pd

df = pd.read_json("foo.json", lines=True)

This will handle the case where your json file looks similar to this:

{"foo": "bar"}
{"foo": "baz"}
{"foo": "qux"}

And will turn it into a DataFrame consisting of a single column, foo, with three rows.

You can read more at Panda's docs

answered Oct 06 '18 at 02:35

Mant1c0r3

129
1
5

1

If downvoting, please provide an explanation for why this answer is not sufficient. – Mant1c0r3 Oct 10 '18 at 02:29
1

Not sure why you're being downvoted! If op's "json" file is actually a line-delimited list of json objects, then yours is a cleaner solution that takes full advantage of pandas. (people often confuse these two types of "json"... I think line-delimtied json should always have have a `.jsonl` extension) Yours is also better because if the `jsonl` file is very large, then you can set a `chunksize` so you get a `JsonReader` back instead of a `DataFrame`. This lets you avoid loading the entire jsonl file into memory. (though `lines=True` is a recent pandas features...) – Chris Mar 23 '19 at 01:08
Note that the new line delimited json format seems to be known as "ndjson": http://ndjson.org/ – Josh Gallagher Aug 09 '19 at 12:46

score 5 · Answer 3 · answered Apr 16 '21 at 13:28

Using the arg lines=True and chunksize=X will create a reader that get specific number of lines.

Then you have to make a loop to display each chunk.

Here is a piece of code for you to understand :

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

Chunks create a multiple of chunks according to the lenght of your json (talking in lines). For example, I have a 100 000 lines json with X objects in it, if I do chunksize = 10 000, I will have 10 chunks.

In the code that I gave I added a break in order to just print the first chunk but if you remove it, you will have 10 chunks one by one.

Janani Sankarasubramanian · Answer 4 · 2022-10-03T10:24:04.613

3

I'm improvising Max's answer to load a large json file into a dataframe without running into memory errors:

You could use the following code and you wont run into any issues.

chunks = pd.read_json('/content/gdrive/My Drive/yelp/yelp_academic_dataset_review.json', lines=True, chunksize = 10000)
reviews = pd.DataFrame()
for chunk in chunks:
  reviews = pd.concat([reviews, chunk])

edited Oct 03 '22 at 10:24

answered Oct 02 '22 at 21:09

Janani Sankarasubramanian

63
1
9

score 0 · Answer 5 · answered Aug 08 '21 at 17:29

If your json file contains multiple object instead of one object, the following should work:

import json

data = []
for line in open('sample.json', 'r'):
    data.append(json.loads(line))

Notice the difference between json.load and json.loads.

json.loads() expects a (valid) JSON string - i.e. {"foo": "bar"}. So, if your json file looks like what @Mant1c0r3 mentioned, then json.loads would be appropriate.

How to read a large json in pandas?

5 Answers5

Linked