3

I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:

{
    "votes": {
        "funny": 0, "useful": 2, "cool": 1},
    "user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
    "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
    "text": (
        "dr. goldberg offers everything i look for in a general practitioner.  "
        "he's nice and easy to talk to without being patronizing; he's always on "
        "time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
        "which my parents have explained to me is very important in case something "
        "happens and you need surgery; and you can get referrals to see specialists "
        "without having to see him first.  really, what more do you need?  i'm "
        "sitting here trying to think of any complaints i have about him, but i'm "
        "really drawing a blank."
    ),
    "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}

How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.

EDIT

As requested pandas code :

reading the json

with open('yelp_academic_dataset_review.json') as f:
    df = pd.DataFrame(json.loads(line) for line in f)

Creating the dictionary

dict = {} 

for i, row in df.iterrows():
   business_id = row['business_id']
   user_id = row['user_id']
   rating = row['stars']
   key = (business_id, user_id)
   dict[key] = rating
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
mnmbs
  • 353
  • 3
  • 13
  • Possible duplicate of [Iteratively parse JSON file](http://stackoverflow.com/questions/20885797/iteratively-parse-json-file) – Chris Martin Dec 03 '15 at 23:12
  • Is there any other way to do it using only pandas? – mnmbs Dec 03 '15 at 23:19
  • Show your pandas code for reading the json and the conversion into a dictionary. – Mike Müller Dec 03 '15 at 23:46
  • Adding this as a general comment because it's not specific to my answer, you might consider whether or not it's time for a database. With big data sets there's a point where storing things in memory, or flat files, or json files is no longer practical and it's time to use a database. Not sure if you're at that point, but it's something to keep in mind. Python has [sqlite3](https://docs.python.org/3.4/library/sqlite3.html)—which you can also use with [sqlalchemy](http://www.sqlalchemy.org/)—for "easy" database needs. – Michelle Welcks Dec 04 '15 at 00:15
  • I want to stick to pandas and i think that dictioanries are the fastest data structures available in pandas for the operations i want to do. After that i want to find some specific users , for example the users who have done more than 50 reviews to unique places. – mnmbs Dec 04 '15 at 00:19

1 Answers1

4

You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:

sample.json

{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}

read_json.py

import json

with open('sample.json', 'r') as fh:
    result_dict = json.load(fh)

print(result_dict['user_id'])
print(result_dict['stars'])

output

Xqd0DzHaiyRqVH3WRG7hzg
5

With that output you can easily create a DataFrame.

There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.

In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.

Michelle Welcks
  • 3,513
  • 4
  • 21
  • 34
  • I already tried this but i get an error probably because the json has some nested values. Error : `ValueError: Extra data: line 2 column 1 - line 1569265 column 1 (char 763 - 1426365176)` – mnmbs Dec 03 '15 at 23:54
  • @mnmbs Take a look at this Stackoverflow answer; it might help: [Python json.loads shows ValueError: Extra data](http://stackoverflow.com/a/21058946/3182836) – Michelle Welcks Dec 03 '15 at 23:59
  • The file must be read in binary mode (`with open('sample.json', 'rb') as fh`). Simple 'r' may produce **UnicodeDecodeError**. (It happened to me a lot of times.) – Apostolos Jul 06 '20 at 06:38