I want to build a sentiment bot based on Reddit comments, so I've started working with the Reddit PRAW library for Python. Just googling this topic, it seems that getting all thread comments can be somewhat tricky.
The PRAW code itself is fairly simple (based on this documentation page: https://praw.readthedocs.io/en/stable/tutorials/comments.html), but it also is very slow for a large thread.
I also came across this SO post about getting the threads in JSON: Retrieving all comments from a thread on Reddit. The solution per the question is to access response items of kind=more
. I made a solution using direct calls to the API (i.e., no PRAW), but I'm getting inconsistent results for the number of comments returned.
PRAW method:
import praw
reddit = praw.Reddit(client_id="<MYKEY>",
client_secret="<MY_SECRET_KEY>",
user_agent="USERAGENT",
check_for_async=False)
url = "https://www.reddit.com/r/CryptoCurrency/comments/11rfcjy/daily_general_discussion_march_15_2023_gmt0/"
submission = reddit.submission(url=url)
submission.comments.replace_more(limit=None)
comments = submission.comments.list()
print("top level comments:", submission.comments.__len__())
print("total comments:", len(comments))
JSON API method:
import requests
import time
import numpy as np
# details for getting a token can be found here: https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c
TOKEN = get_reddit_token()
headers = {"user-agent": "Mozilla/5.0"}
url = "https://www.reddit.com/r/CryptoCurrency/comments/11rfcjy/daily_general_discussion_march_15_2023_gmt0/.json"
req = requests.get(url, headers=headers)
res = req.json()
body = res[1]["data"]["children"]
thread_id = body[-1]["data"]["parent_id"]
all_comments = {c["data"]["id"]: c for c in body if c["kind"] == "t1"}
comment_ids = [c["data"]["id"] for c in body if c["kind"] == "t1"]
comment_ids += body[-1]["data"]["children"]
def get_more_comments(more_ids, get_replies=True):
more_children = ",".join(more_ids)
more_url = f"https://oauth.reddit.com/api/morechildren/.json?api_type=json&link_id={thread_id}&children={more_children}&sort=top"
headers = {'user-agent': 'USERAGENT', 'Authorization': f"bearer {TOKEN}"}
res = requests.get(more_url, headers=headers)
comments = res.json()
comments = comments["json"]["data"]["things"]
t1_comments = [c for c in comments if c["kind"] == "t1"]
more_comments = [c for c in comments if c["kind"] == "more"]
print("new comments", len(t1_comments))
print("more comments", len(more_comments))
for comment in t1_comments:
comment_id = comment["data"]["id"]
all_comments[comment_id] = comment
if get_replies:
more_comments = [c["data"]["children"] for c in more_comments]
else:
more_comments = [c["data"]["children"] for c in more_comments if c["data"]["parent_id"] == thread_id]
# flatten list of lists
more_comments = [c for c_list in more_comments for c in c_list]
return more_comments
for i in range(100):
print(i)
existing_comments = list(all_comments.keys())
eligible_comments = np.isin(comment_ids, existing_comments)
eligible_comments = np.array(comment_ids)[~eligible_comments].tolist()
more_ids = eligible_comments[:100]
more_comments = get_more_comments(more_ids)
comment_ids += more_comments
comment_ids = list(set(comment_ids))
random.shuffle(comment_ids)
time.sleep(1)
print("top level comments:", len([k for k,v in all_comments.items() if v["data"]["depth"] == 0]))
print("total comments:", len(all_comments.keys()))
The basic idea of the second method is to get the initial json response of the thread (note that the URLs are the same in both examples except that the second has .json
appended to the end) and capture the comment ids for any items that have kind=more
.
For this example, I'm iterating an arbitrary number of times to try pulling new comments but after a while it stops getting new comments.
This runs quickly even sleeping between requests, so it would be great if I could use this method if nothing can be done about the speed of the PRAW replace_more
method, but I want to get all comments.
The PRAW method took over 30 minutes to run and returned 1259 top comments and 4686 total comments as of the time I ran this. The JSON method returned 1259 top comments (that's good since it matches PRAW) and 4511 total comments (fewer than PRAW).
Note: When I started writing this question, I was getting many fewer comments for the JSON method, but adding sort=top
to the comments URL (I've updated this in the code - sort=new
also works). Even though the results are a lot closer now, I'm going to post in case anyone can point out why I'm not getting all of the comments.
I'd like to get 100% completeness if possible and the question might help others trying to scrape comments more efficiently.