0

I am writing a python chatbot with tensorflow that is utilizing the dump of all Reddit comments over the past few years found here https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/?st=j9udbxta&sh=69e4fee7. I downloaded the comments through the torrent, and all seemed to go well. However, when I read the JSON file into a python program, the entire file does not seem to load. Each month's data in 2015 is around 15,000KB, but JSON will only load the first 2600 lines while the true file has hundred's of thousands of lines. When I look at the last row loaded from the JSON file it seems to be cut off short and in the middle of a sentence for some reason like this.

    {"subreddit":"sydney","author_flair_text":null,"id":"cqugtij","gilded":0,"removal_reason":null,"downs":0,"archived":false,"created_utc":"1430439358","link_id":"t3_34e5fd","ups":6,"subreddit_id":"t5_2qkob","name":"t1_cqugtij","score_hidden":false,"author_flair_css_class":null,"parent_id":"t1_cqttsc3","controversiality":0,"score":6,"author":"SilverMeteor9798","body":"As state transport minister almost every press release from Gladys had something in there about how the liberals were \"getting on with the job\" and blaming Labor for something. It wasn't necessarily false, it just got tiresome after a while particular

This is the code I am using to read the JSON file

    timeframe = '2015-05'
    with open("Data/reddit_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
        for row in f:
            row = json.loads(row)

Where timeframe is the specific JSON file related to the Reddit comments in 05/2015. When I run this code, I get this error

    json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 368 (char 367)

This makes sense to me because the last line of the loaded JSON file is cut short, but how can I let python read the entire JSON file? I am following sentdex' chatbot tutorial on YouTube (https://www.youtube.com/watch?v=dvOnYLDg8_Y), and even when I run his exact code I get the same error. How do I get the entire JSON file loaded so that I can read the hundreds of thousands of comments? I have tried to change the buffering, and I have tried redownloading the comments.

Mohan Radhakrishnan
  • 3,002
  • 5
  • 28
  • 42
  • Welcome to SO! I think your question has already been answered before here: https://stackoverflow.com/a/17326199/7315159, but let feel free to clarify if not :) – Niayesh Isky Sep 30 '18 at 04:47
  • I am not sure it fully fixes my issue for a few reasons. I don't think it is exactly because the file is too large that I am having this problem because when I load different month's of comments, vastly different numbers of rows will load. For example, in 2014-07, 2100 rows load while in 2014-05 84000 rows load. Also, in his tutorial, sentdex does not use anything but JSON, so why would it work on his computer but not mine? Thank you for this tip though. – Jake Sledge Sep 30 '18 at 15:03

0 Answers0