Fixing Large JSON dataset with Unreliable Line Breaks in Python

Question

I've streamed in a lot of twitter data using the free 1% stream into an S3 bucket and just downloaded it for some analyses.

I noticed that some of my JSON data comes out as invalid JSON when I use www.jsonlint.com and after some digging, I found that it was because some tweets weren't all separated by a newline character.

Can someone point me in the right direction towards fixing this? I think my approach would be to go through each file and check if there's a newline character between tweets (I think twitter uses \r\n right?). If it's not there I have to add it...

Also, is there a reason this happens? Is it an issue with the code in my Streamer (it's a node.js script that collects everything).

Here's a sample dataset (just a few representative tweets that caused the same issue)...: http://pastebin.com/8AjM6yc2

Some sample code that doesn't get added to the list:

{"created_at":"Sun Sep 18 23:58:50 +0000 2016","id":777658170751582200,"id_str":"777658170751582208","text":"nobody tell him i like the packers okay? @ University of Phoenix Stadium www.example.com","source":"<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":35321729,"id_str":"35321729","name":"sammy","screen_name":"guzzzyy","location":"university of arizona '19","url":null,"description":"loves otters and conspiracy theories | ΣK","protected":false,"verified":false,"followers_count":620,"friends_count":306,"listed_count":4,"favourites_count":15824,"statuses_count":20940,"created_at":"Sat Apr 25 21:58:01 +0000 2009","utc_offset":-25200,"time_zone":"Arizona","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/447840344499974144/A8FRdFXz.png","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/447840344499974144/A8FRdFXz.png","profile_background_tile":true,"profile_link_color":"DCBBFA","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"ED0043","profile_text_color":"FFFFFF","profile_use_background_image":true,"profile_image_url":"http://pbs.twimg.com/profile_images/768288586886029315/h5-HBL5y_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/768288586886029315/h5-HBL5y_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/35321729/1473271290","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.52812869,-112.26250073]},"coordinates":{"type":"Point","coordinates":[-112.26250073,33.52812869]},"place":{"id":"a612c69b44b2e5da","url":"https://api.twitter.com/1.1/geo/id/a612c69b44b2e5da.json","place_type":"admin","name":"Arizona","full_name":"Arizona, USA","country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-114.818269,31.332246],[-114.818269,37.004261],[-109.045153,37.004261],[-109.045153,31.332246]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"www.example.com","expanded_url":"https://www.instagram.com/p/BKhD2tfhOCK/","display_url":"instagram.com/p/BKhD2tfhOCK/","indices":[73,96]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1474243130754"},
{"created_at":"Sun Sep 18 23:58:50 +0000 2016","id":777658171657691100,"id_str":"777658171657691136","text":"Pastor @pastormurph &amp; @zebonperiscope\n@ChangingAGenAtl @ Changing A Generation Full Gospel… www.example.com","source":"<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":157062565,"id_str":"157062565","name":"Lady Tyisha Phillips","screen_name":"TyishaPhillips","location":"Atlanta, GA","url":"http://www.cagmin.org","description":"Young Adult Ministry Co-Pastor","protected":false,"verified":false,"followers_count":643,"friends_count":609,"listed_count":4,"favourites_count":163,"statuses_count":1752,"created_at":"Fri Jun 18 19:02:11 +0000 2010","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"BDB9BD","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000149695120/SnBdvtk3.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000149695120/SnBdvtk3.jpeg","profile_background_tile":true,"profile_link_color":"990000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http://pbs.twimg.com/profile_images/666443040052142080/6huKB94N_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/666443040052142080/6huKB94N_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/157062565/1386911172","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.68266,-84.49847]},"coordinates":{"type":"Point","coordinates":[-84.49847,33.68266]},"place":{"id":"8173485c72e78ca5","url":"https://api.twitter.com/1.1/geo/id/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.576827,33.647503],[-84.576827,33.886886],[-84.289385,33.886886],[-84.289385,33.647503]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"www.example.com","expanded_url":"https://www.instagram.com/p/BKhD1gHAwKL/","display_url":"instagram.com/p/BKhD1gHAwKL/","indices":[96,119]}],"user_mentions":[{"screen_name":"pastormurph","name":"William Murphy","id":21972711,"id_str":"21972711","indices":[7,19]},{"screen_name":"ZEBonPeriscope","name":"Zebulon Ellis","id":3327209906,"id_str":"3327209906","indices":[26,41]},{"screen_name":"CHANGINGAGENATL","name":"CAGFGBC ATL","id":234880261,"id_str":"234880261","indices":[42,58]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1474243130970"},
{"created_at":"Sun Sep 18 23:58:51 +0000 2016","id":777658172081176600,"id_str":"777658172081176576","text":" @ Villas de la Boca www.example.com","source":"<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":879232754,"id_str":"879232754","name":"Tadeo Martinez","screen_name":"LuizTigres","location":"Instagram","url":"http://Instagram.com/luiztadeo","description":"Hincha de Tigres ⚽️ en las buenas te quiero , en las malas Teamo! Snap:tadeo.mtz","protected":false,"verified":false,"followers_count":148,"friends_count":134,"listed_count":1,"favourites_count":686,"statuses_count":1987,"created_at":"Sun Oct 14 03:31:55 +0000 2012","utc_offset":-14400,"time_zone":"Eastern Time (US & Canada)","geo_enabled":true,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"131516","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000117839880/ae0d0e92e0b9ff5c5b9184636d7a8220.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000117839880/ae0d0e92e0b9ff5c5b9184636d7a8220.jpeg","profile_background_tile":true,"profile_link_color":"009999","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http://pbs.twimg.com/profile_images/757612542084599808/DElVfT1O_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/757612542084599808/DElVfT1O_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/879232754/1464846885","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[25.4462465,-100.09724682]},"coordinates":{"type":"Point","coordinates":[-100.09724682,25.4462465]},"place":{"id":"2c05f2ee0a17497d","url":"https://api.twitter.com/1.1/geo/id/2c05f2ee0a17497d.json","place_type":"city","name":"Santiago","full_name":"Santiago, Nuevo León","country_code":"MX","country":"México","bounding_box":{"type":"Polygon","coordinates":[[[-100.530034,25.228247],[-100.530034,25.521547],[-100.028913,25.521547],[-100.028913,25.228247]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"www.example.com","expanded_url":"https://www.instagram.com/p/BKhD2xWhwdMPu8FAx3zoZXoqrgTd-ZrksAR76E0/","display_url":"instagram.com/p/BKhD2xWhwdMP…","indices":[22,45]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"es","timestamp_ms":"1474243131071"},
{"created_at":"Sun Sep 18 23:58:51 +0000 2016","id":777658172458827800,"id_str":"777658172458827776","text":"There's nothing better than finding someone to serve Jesus with ❤️ |… www.example.com","source":"<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":204172559,"id_str":"204172559","name":"Amanda Gutierrez ❁","screen_name":"agutierrez910","location":null,"url":"http://newcreationca.org","description":"Worship Pastor at New Creation Church, Wife to @willspeaks, mommy of 3 boys, & child of an AMAZING God! #winning","protected":false,"verified":false,"followers_count":519,"friends_count":272,"listed_count":10,"favourites_count":1969,"statuses_count":3823,"created_at":"Mon Oct 18 02:45:16 +0000 2010","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"ACDED6","profile_background_image_url":"http://abs.twimg.com/images/themes/theme18/bg.gif","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme18/bg.gif","profile_background_tile":false,"profile_link_color":"038543","profile_sidebar_border_color":"EEEEEE","profile_sidebar_fill_color":"F6F6F6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http://pbs.twimg.com/profile_images/771235854714929152/PD_xu4Od_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/771235854714929152/PD_xu4Od_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/204172559/1472711595","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.92489,-116.88952]},"coordinates":{"type":"Point","coordinates":[-116.88952,33.92489]},"place":{"id":"792551bc9bd3c992","url":"https://api.twitter.com/1.1/geo/id/792551bc9bd3c992.json","place_type":"city","name":"Banning","full_name":"Banning, CA","country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-116.947005,33.902607],[-116.947005,33.94771],[-116.859016,33.94771],[-116.859016,33.902607]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"www.example.com","expanded_url":"https://www.instagram.com/p/BKhD22EAPT04KMmkxbg5yAGTNFsGdty870jAM80/","display_url":"instagram.com/p/BKhD22EAPT04…","indices":[70,93]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1474243131161"},

And the code I used:

from json import JSONDecoder 
from functools import partial 
import os
import json
import io  
import csv  

def json_parse(file, decoder=JSONDecoder(), buffersize=2048):
    buffer = ''
    for chunk in iter(partial(file.read, buffersize), ''):
        buffer += chunk
        while buffer:
            try:
                #print("success")
                result, index = decoder.raw_decode(buffer)
                yield result
                buffer = buffer[index:]
            except ValueError:
                #print("fail")
                break


dataset = [] 

for file in os.listdir():
    #print(file)
    if(file.startswith("twitter")):
        with open(file, 'r', encoding='utf-8') as infh:
            for data in json_parse(infh):
                dataset.append(data)

print(len(dataset))

Sorry, we are not going to download a half-megabyte sample file. I'm sure you can reduce this to a much smaller sample set that can be inlined in the question. However, this is a duplicate of one of two questions depending on where newlines fall; between each JSON object, or also inside JSON objects. Your description sounds like the latter, so I duped you to that second question. — Martijn Pieters, Oct 01 '16 at 13:51
Hi, I actually saw your comment on reading the data if there are no newlines using the buffer method. But when I implement that, I actually get a lot of tweets that "fail" (probably due to ValueError). I imagine this is because some tweets have the newline at the end and some don't. How would I go about resolving it in that case? — shishy, Oct 01 '16 at 14:37
That doesn't make much sense; the post I duped you to doesn't rely on newlines at all. You'll have to provide a proper [mcve] if that post doesn't solve your situation. — Martijn Pieters, Oct 01 '16 at 14:40
I updated with the code I used and a few tweets I had. I put the tweets in pastebin, is that okay? — shishy, Oct 01 '16 at 14:54
Can you reduce that to a small sample that shows the issue? Then put that in your post, and include the full traceback of the error. — Martijn Pieters, Oct 01 '16 at 14:55
External URLs have a different lifetime from a Stack Overflow post. We want questions and their answers to last here, so future visitors can see if they have the same problem and can apply the solution to their situation. External URLs don't fit that goal. — Martijn Pieters, Oct 01 '16 at 14:56
That's fair. I'm trying to copy paste it in here but the problem is that some of the tweets have URLs that were shortened and it won't let me submit a post with shortened URLs :/ — shishy, Oct 01 '16 at 15:02
Otherwise, replace those with `http://example.com` URLs. The exact URL doesn't matter to the problem, after all. — Martijn Pieters, Oct 01 '16 at 15:03
Okay, I've added some sample data in. Four tweets. These all go through the ValueError exception so they are "fail" in that they don't get appended to my data. Of the 2 million tweets I have, only ~2500 are getting added so the vast majority are failing for some reason... — shishy, Oct 01 '16 at 15:10
Also thank you for all your help by the way, I really appreciate it! — shishy, Oct 01 '16 at 15:10
We do need the full error message too; what is the traceback? A `ValueError` is a class of exceptions, not a specific error. — Martijn Pieters, Oct 01 '16 at 15:11
Or are you referring to the `ValueError` catching in my method? That's there to find the next object boundary when the buffer is too small. This doesn't skip objects. — Martijn Pieters, Oct 01 '16 at 15:12
Yeah, I was referring to that in your method. Wow, I completely misunderstood that. My bad. Basically, I'm at a loss for why even though it's going through all the files, not all of my tweets are getting added to the list. This is really strange... — shishy, Oct 01 '16 at 15:17
So I need to go through each file and add a \ before those quotes... that makes a hell of a lot more sense — shishy, Oct 01 '16 at 15:19
Ah, you have commas between these objects, as if they are part of *one large JSON object*. Is there a `[` at the very start of the file, and a `]` at the end? — Martijn Pieters, Oct 01 '16 at 15:23
If so, then you have **one huge** JSON object, not loads of small ones. — Martijn Pieters, Oct 01 '16 at 15:24
Otherwise, you'll have to change `decoder.raw_decode(buffer)` to `decoder.raw_decode(buffer.lstrip(',\n')` to remove any leading commas and newlines before parsing the next object. — Martijn Pieters, Oct 01 '16 at 15:25
Right, when the data streams in there's a comma after each {}. So by adding a [] at the beginning and end of the file (i.e. the beginning of the first file and end of last), it would interpret each {} as a separate JSON? — shishy, Oct 01 '16 at 15:25
Okay, when I did buffer.lstrip as you suggested without adding square brackets, I got 5594 tweets as opposed to the previous 2789. But there should be way more because this is a week-long collection of 1% of data and there's 7GB of tweets in here... — shishy, Oct 01 '16 at 15:29
`{...}` is a valid JSON document. `{...},` is *not*. `[{...}, {...}, {...}]` would be a valid JSON document. So if you added `[` at the start and `]` at the end, you should be able to load the whole file with `json.load()`. — Martijn Pieters, Oct 01 '16 at 15:41
To do some troubleshooting, I counted the number of files in that directory (=2798). This confirms what you said earlier because it was treating each file as one huge JSON object (earlier I was only getting 2798 entries added to my list). Now with the lstrip though I get 5594 which is almost twice as before. I'm thinking that for some reason in each file it's still not able to parse the individual objects. — shishy, Oct 01 '16 at 15:42
I'm amazed you got that many out with this technique on something that doesn't contain individual valid JSON objects, to be frank. — Martijn Pieters, Oct 01 '16 at 15:43
In reply to the comment about json.load, what I initially tried (see: http://stackoverflow.com/questions/39781716/loading-large-twitter-json-data-7gb-into-python) was to merge all tweets into one file and then try to just use json.load on that but it failed due to memory issues. Would you suggest that I remake each file in the [ ] format and then json.load each file separately? It should still be able to parse the individual tweets out that way right? — shishy, Oct 01 '16 at 15:43
Yes, process your data file by file if the total number of tweets is to large to load into memory in one go. — Martijn Pieters, Oct 01 '16 at 15:52
I'll try this then. Thanks for your help! Hopefully I can update with an "it worked!" — shishy, Oct 01 '16 at 15:52
Dear Martijn, after some tinkering I realized the root of the issue. I haven't been able to use json.loads because it threw a ValueError, citing that the first " (before created_at) wasn't there. I realized that for some reason, these quotes were being converted to a unicode left double quotation mark (i.e. to \xe2\x80\x9ccreated_at" instead of "created_at"). I'm not sure how to fix this -- I tried to use .replace but that doesn't work at all. Just thought I'd let you know in case you were curious. I have to find out what in my configuration is causing this issue too... — shishy, Oct 01 '16 at 18:59

Fixing Large JSON dataset with Unreliable Line Breaks in Python

0 Answers0