2

I am working with a Big valid JSON file. I am trying to parse this file using Pandas. When I try to read this file with Normal data = pd.read_json(filename) Method. It reads the file. But when I try to use the parameter lines=Truedata = pd.read_json(filename, lines=True) its throws an error ValueError: Expected object or value

I want to read this file using Chunks. But I get the same error If I use the parameter chunksize.

Can someone point out what I am doing wrong here.

filename='data/tinyTwitter.json'
data = pd.read_json(filename, lines=True, chunksize=100)

Data

{
   "total_rows":3877777,
   "offset":805584,
   "rows":[
      {
         "id":"570379215192727552",
         "key":[
            "r1r01cdn8nb4",
            2015,
            2,
            25
         ],
         "value":{
            "type":"Feature",
            "geometry":{
               "type":"Point",
               "coordinates":[
                  144.92340088,
                  -37.95935781
               ]
            },
            "properties":{
               "created_at":"Wed Feb 25 00:26:16 +0000 2015",
               "text":"For the Oscars, Lady Gaga trained with a vocal coach DAILY for 6 months httmelbourne htto/ZSu8FifNUK",
               "location":"melbourne"
            }
         },
         "doc":{
            "_id":"570379215192727552",
            "_rev":"1-fa6a485cb4fe0575781b6c29286af554",
            "contributors":null,
            "truncated":false,
            "text":"For the Oscars, Lady Gaga trained with a vocal coach DAILY for 6 months htDIIS5EtsW9 #melbourne ho/ZSu8FifNUK",
            "in_reply_to_status_id":null,
            "favorite_count":0,
            "source":"",
            "retweeted":false,
            "coordinates":{
               "type":"Point",
               "coordinates":[
                  144.92340088,
                  -37.95935781
               ]
            },
            "entities":{
               "symbols":[

               ],
               "user_mentions":[

               ],
               "hashtags":[
                  {
                     "indices":[
                        95,
                        105
                     ],
                     "text":"melbourne"
                  }
               ],
               "urls":[
                  {
                     "url":"",
                     "indices":[
                        72,
                        94
                     ],
                     "expanded_url":"",
                     "display_url":"j.mp/1ag2Quk"
                  }
               ],
               "media":[
                  {
                     "expanded_url":"",
                     "display_url":"pir.FifNUK",
                     "url":"http/ZSu8FifNUK",
                     "media_url_https":"",
                     "id_str":"570379215142457344",
                     "sizes":{
                        "large":{
                           "h":380,
                           "resize":"fit",
                           "w":380
                        },
                        "small":{
                           "h":340,
                           "resize":"fit",
                           "w":340
                        },
                        "medium":{
                           "h":380,
                           "resize":"fit",
                           "w":380
                        },
                        "thumb":{
                           "h":150,
                           "resize":"crop",
                           "w":150
                        }
                     },
                     "indices":[
                        106,
                        128
                     ],
                     "type":"photo",
                     "id":570379215142457340,
                     "media_url":""
                  }
               ]
            },
            "in_reply_to_screen_name":null,
            "in_reply_to_user_id":null,
            "retweet_count":0,
            "id_str":"570379215192727552",
            "favorited":false,
            "user":{
               "follow_request_sent":false,
               "profile_use_background_image":true,
               "profile_text_color":"333333",
               "default_profile_image":false,
               "id":2543131938,
               "profile_background_image_url_https":"",
               "verified":false,
               "profile_location":null,
               "profile_image_url_https":"",
               "profile_sidebar_fill_color":"DDEEF6",
               "entities":{
                  "url":{
                     "urls":[
                        {
                           "url":"",
                           "indices":[
                              0,
                              22
                           ],
                           "expanded_url":"",
                           "display_url":"youthsnews.com.au"
                        }
                     ]
                  },
                  "description":{
                     "urls":[

                     ]
                  }
               },
               "followers_count":68313,
               "profile_sidebar_border_color":"C0DEED",
               "id_str":"2543131938",
               "profile_background_color":"C0DEED",
               "listed_count":6,
               "is_translation_enabled":false,
               "utc_offset":36000,
               "statuses_count":1390,
               "description":"media network",
               "friends_count":788,
               "location":"pacific, oceania",
               "profile_link_color":"042A38",
               "profile_image_url":"",
               "following":false,
               "geo_enabled":true,
               "profile_banner_url":"h8",
               "profile_background_image_url":"htng",
               "name":"ynnmedia™",
               "lang":"en",
               "profile_background_tile":false,
               "favourites_count":765,
               "screen_name":"ynnmedianetwork",
               "notifications":false,
               "url":"htxq",
               "created_at":"Tue Jun 03 09:27:23 +0000 2014",
               "contributors_enabled":false,
               "time_zone":"Yakutsk",
               "protected":false,
               "default_profile":false,
               "is_translator":false
            },
            "geo":{
               "type":"Point",
               "coordinates":[
                  -37.95935781,
                  144.92340088
               ]
            },
            "in_reply_to_user_id_str":null,
            "possibly_sensitive":false,
            "lang":"en",
            "created_at":"Wed Feb 25 00:26:16 +0000 2015",
            "in_reply_to_status_id_str":null,
            "place":null,
            "metadata":{
               "iso_language_code":"en",
               "result_type":"recent"
            },
            "location":"melbourne"
         }
      },
      {
         "id":"570379220146200576",
         "key":[
            "r1r01cdn8nb4",
            2015,
            2,
            25
         ],
         "value":{
            "type":"Feature",
            "geometry":{
               "type":"Point",
               "coordinates":[
                  144.92340088,
                  -37.95935781
               ]
            },
            "properties":{
               "created_at":"Wed Feb 25 00:26:17 +0000 2015",
               "text":"Abuses in AIB Roast were dubbed: Rakhi Sawant Ka",
               "location":"melbourne"
            }
         },
         "doc":{
            "_id":"570379220146200576",
            "_rev":"1-61252163c64f6f548cab2b8eb4cbd045",
            "contributors":null,
            "truncated":false,
            "text":"Abuses in AIB Roast were dubbed: Rakhi Sawant ourne htco/MbglBYEAKa",
            "in_reply_to_status_id":null,
            "favorite_count":0,
            "source":"t</a>",
            "retweeted":false,
            "coordinates":{
               "type":"Point",
               "coordinates":[
                  144.92340088,
                  -37.95935781
               ]
            },
            "entities":{
               "symbols":[

               ],
               "user_mentions":[

               ],
               "hashtags":[
                  {
                     "indices":[
                        69,
                        79
                     ],
                     "text":"melbourne"
                  }
               ],
               "urls":[
                  {
                     "url":"htKiAELeMO6",
                     "indices":[
                        46,
                        68
                     ],
                     "expanded_url":"/1ag2Omb",
                     "display_url":"j.mp/1ag2Omb"
                  }
               ],
               "media":[
                  {
                     "expanded_url":"h79220146200576/photo/1",
                     "display_url":"pglBYEAKa",
                     "url":"rr",
                     "media_url":"pk4O5UIAAI0l",
                     "id_str":"570379220049731584",
                     "sizes":{
                        "large":{
                           "h":380,
                           "resize":"fit",
                           "w":380
                        },
                        "small":{
                           "h":340,
                           "resize":"fit",
                           "w":340
                        },
                        "medium":{
                           "h":380,
                           "resize":"fit",
                           "w":380
                        },
                        "thumb":{
                           "h":150,
                           "resize":"crop",
                           "w":150
                        }
                     },
                     "indices":[
                        80,
                        102
                     ],
                     "type":"photo",
                     "id":570379220049731600,
                     "media_urrl":"htpk4O5UIAAI0l1.jpg"
                  }
               ]
            },
            "in_reply_to_screen_name":null,
            "in_reply_to_user_id":null,
            "retweet_count":0,
            "id_str":"570379220146200576",
            "favorited":false,
            "user":{
               "follow_request_sent":false,
               "profile_use_background_image":true,
               "profile_text_color":"333333",
               "default_profile_image":false,
               "id":2543131938,
               "profile_background_image_url_https":"h/images/themes/theme1/bg.png",
               "verified":false,
               "profile_location":null,
               "profile_image_url_https":"htt/567602629937606657/ZCcCDFzr_normal.jpeg",
               "profile_sidebar_fill_color":"DDEEF6",
               "entities":{
                  "url":{
                     "urls":[
                        {
                           "url":"htAxq",
                           "indices":[
                              0,
                              22
                           ],
                           "expanded_url":"hws.com.au",
                           "display_url":"youth.au"
                        }
                     ]
                  },
                  "description":{
                     "urls":[

                     ]
                  }
               },
               "followers_count":68313,
               "profile_sidebar_border_color":"C0DEED",
               "id_str":"2543131938",
               "profile_background_color":"C0DEED",
               "listed_count":6,
               "is_translation_enabled":false,
               "utc_offset":36000,
               "statuses_count":1390,
               "description":"media network",
               "friends_count":788,
               "location":"pacific, oceania",
               "profile_link_color":"042A38",
               "profile_image_url":"htes/567602629937606657/ZCcCDFzr_normal.jpeg",
               "following":false,
               "geo_enabled":true,
               "profile_banner_url":"httpanners/2543131938/1424079798",
               "profile_background_image_url":"http/themes/theme1/bg.png",
               "name":"ynnmedia™",
               "lang":"en",
               "profile_background_tile":false,
               "favourites_count":765,
               "screen_name":"ynnmedianetwork",
               "notifications":false,
               "url":"httgeAxq",
               "created_at":"Tue Jun 03 09:27:23 +0000 2014",
               "contributors_enabled":false,
               "time_zone":"Yakutsk",
               "protected":false,
               "default_profile":false,
               "is_translator":false
            },
            "geo":{
               "type":"Point",
               "coordinates":[
                  -37.95935781,
                  144.92340088
               ]
            },
            "in_reply_to_user_id_str":null,
            "possibly_sensitive":false,
            "lang":"en",
            "created_at":"Wed Feb 25 00:26:17 +0000 2015",
            "in_reply_to_status_id_str":null,
            "place":null,
            "metadata":{
               "iso_language_code":"en",
               "result_type":"recent"
            },
            "location":"melbourne"
         }
      }
   ]
}
pnv
  • 2,985
  • 5
  • 29
  • 36
Waqar ul islam
  • 418
  • 4
  • 17
  • 1
    Are data confidental? If not, is possible share first 200 rows? – jezrael Apr 05 '19 at 06:48
  • 2
    okay let me share it – Waqar ul islam Apr 05 '19 at 07:07
  • In your input file, do you have one valid json record per line? – Sina Apr 05 '19 at 07:23
  • @jezrael I have attached the sample data. I have data like this with thousands of rows. I pasted to sample records. – Waqar ul islam Apr 05 '19 at 07:25
  • 1
    pandas.read_json only accepts json input in prespecified formats. See the valid formats in the documentation (look at the examples with different orient arguments). According to the documentation, if you select lines=True pandas.read_json expect one valid json per line. You get the error because your input does not adhere to this format. – Sina Apr 05 '19 at 07:36
  • 1
    @Sina Is there a way to change my JSON format in accordance with the JSON format such that I can use lines true properly? – Waqar ul islam Apr 05 '19 at 07:45
  • @Waqarulislam Do you have any control over the source of the data (any chance that you may be able to change the format from the source)? How big is your data? can you fit it in the memory? Is this a one-off task or should it be automated? – Sina Apr 05 '19 at 07:56
  • @Sina I have a file of 10 GB of this data. I want to read this file using MPI in such a way that each process reads its own part and process it. Its a one of task. – Waqar ul islam Apr 05 '19 at 07:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/191300/discussion-between-sina-and-waqar-ul-islam). – Sina Apr 05 '19 at 08:11
  • I remeber an issue regarding this a little back (twitter json formats.. read [here](https://stackoverflow.com/questions/55182417/reading-json-data-into-dataframe/55182631#55182631). It's possibly the same issue where you are getting multiple json structures and will need to split them at the `]``[` within the file – chitown88 Apr 05 '19 at 08:41

1 Answers1

1

I added the link above in the comments. But I believe the issue is the twitter response sends multiple json formats into 1 file, and doesn't break them up. The solution that worked was I took the whole file, and split them into a list. then just worked with each one individually.

import json

filename='data/tinyTwitter.json'

data = []
with open(filename) as json_file:  
    data_str = json_file.read()
    data_str = data_str.split('[',1)[-1]
    data_str = data_str.rsplit(']',1)[0]
    data_str = data_str.split('][')

for jsonStr in data_str:
    jsonStr = '[' + jsonStr + ']'

    temp_data = json.loads(jsonStr)
    for each in temp_data:
        data.append(each)
chitown88
  • 27,527
  • 4
  • 30
  • 59