I am writing a python chatbot with tensorflow that is utilizing the dump of all Reddit comments over the past few years found here https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/?st=j9udbxta&sh=69e4fee7. I downloaded the comments through the torrent, and all seemed to go well. However, when I read the JSON file into a python program, the entire file does not seem to load. Each month's data in 2015 is around 15,000KB, but JSON will only load the first 2600 lines while the true file has hundred's of thousands of lines. When I look at the last row loaded from the JSON file it seems to be cut off short and in the middle of a sentence for some reason like this.
{"subreddit":"sydney","author_flair_text":null,"id":"cqugtij","gilded":0,"removal_reason":null,"downs":0,"archived":false,"created_utc":"1430439358","link_id":"t3_34e5fd","ups":6,"subreddit_id":"t5_2qkob","name":"t1_cqugtij","score_hidden":false,"author_flair_css_class":null,"parent_id":"t1_cqttsc3","controversiality":0,"score":6,"author":"SilverMeteor9798","body":"As state transport minister almost every press release from Gladys had something in there about how the liberals were \"getting on with the job\" and blaming Labor for something. It wasn't necessarily false, it just got tiresome after a while particular
This is the code I am using to read the JSON file
timeframe = '2015-05'
with open("Data/reddit_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
for row in f:
row = json.loads(row)
Where timeframe is the specific JSON file related to the Reddit comments in 05/2015. When I run this code, I get this error
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 368 (char 367)
This makes sense to me because the last line of the loaded JSON file is cut short, but how can I let python read the entire JSON file? I am following sentdex' chatbot tutorial on YouTube (https://www.youtube.com/watch?v=dvOnYLDg8_Y), and even when I run his exact code I get the same error. How do I get the entire JSON file loaded so that I can read the hundreds of thousands of comments? I have tried to change the buffering, and I have tried redownloading the comments.