Python: Reddit PRAW extremely slow pulling comments for large thread

Question

I want to build a sentiment bot based on Reddit comments, so I've started working with the Reddit PRAW library for Python. Just googling this topic, it seems that getting all thread comments can be somewhat tricky.

The PRAW code itself is fairly simple (based on this documentation page: https://praw.readthedocs.io/en/stable/tutorials/comments.html), but it also is very slow for a large thread.

I also came across this SO post about getting the threads in JSON: Retrieving all comments from a thread on Reddit. The solution per the question is to access response items of kind=more. I made a solution using direct calls to the API (i.e., no PRAW), but I'm getting inconsistent results for the number of comments returned.

PRAW method:

import praw

reddit = praw.Reddit(client_id="<MYKEY>",
                     client_secret="<MY_SECRET_KEY>",
                     user_agent="USERAGENT",
                     check_for_async=False)

url = "https://www.reddit.com/r/CryptoCurrency/comments/11rfcjy/daily_general_discussion_march_15_2023_gmt0/"
submission = reddit.submission(url=url)
submission.comments.replace_more(limit=None)
comments = submission.comments.list()
print("top level comments:", submission.comments.__len__())
print("total comments:", len(comments))

JSON API method:

import requests
import time
import numpy as np

# details for getting a token can be found here: https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c
TOKEN = get_reddit_token()

headers = {"user-agent": "Mozilla/5.0"}
url = "https://www.reddit.com/r/CryptoCurrency/comments/11rfcjy/daily_general_discussion_march_15_2023_gmt0/.json"
req = requests.get(url, headers=headers)
res = req.json()
body = res[1]["data"]["children"]
thread_id = body[-1]["data"]["parent_id"]

all_comments = {c["data"]["id"]: c for c in body if c["kind"] == "t1"}
comment_ids = [c["data"]["id"] for c in body if c["kind"] == "t1"]
comment_ids += body[-1]["data"]["children"]

def get_more_comments(more_ids, get_replies=True):

    more_children = ",".join(more_ids)
    more_url = f"https://oauth.reddit.com/api/morechildren/.json?api_type=json&link_id={thread_id}&children={more_children}&sort=top"
    
    headers = {'user-agent': 'USERAGENT', 'Authorization': f"bearer {TOKEN}"}
    res = requests.get(more_url, headers=headers) 
    comments = res.json()
    comments = comments["json"]["data"]["things"]
    t1_comments = [c for c in comments if c["kind"] == "t1"]
    more_comments = [c for c in comments if c["kind"] == "more"]
    print("new comments", len(t1_comments))
    print("more comments", len(more_comments))
    
    for comment in t1_comments:
        comment_id = comment["data"]["id"]
        all_comments[comment_id] = comment
    
    if get_replies:
        more_comments = [c["data"]["children"] for c in more_comments]
    else:
        more_comments = [c["data"]["children"] for c in more_comments if c["data"]["parent_id"] == thread_id]
    
    # flatten list of lists
    more_comments = [c for c_list in more_comments for c in c_list]
    
    return more_comments
    
    
    
for i in range(100):
    print(i)
    existing_comments = list(all_comments.keys())
    eligible_comments = np.isin(comment_ids, existing_comments)
    eligible_comments = np.array(comment_ids)[~eligible_comments].tolist()
    more_ids = eligible_comments[:100]
    more_comments = get_more_comments(more_ids)
    comment_ids += more_comments
    comment_ids = list(set(comment_ids))
    random.shuffle(comment_ids)
    
    time.sleep(1)

print("top level comments:", len([k for k,v in all_comments.items() if v["data"]["depth"] == 0]))
print("total comments:", len(all_comments.keys()))

The basic idea of the second method is to get the initial json response of the thread (note that the URLs are the same in both examples except that the second has .json appended to the end) and capture the comment ids for any items that have kind=more.

For this example, I'm iterating an arbitrary number of times to try pulling new comments but after a while it stops getting new comments. This runs quickly even sleeping between requests, so it would be great if I could use this method if nothing can be done about the speed of the PRAW replace_more method, but I want to get all comments.

The PRAW method took over 30 minutes to run and returned 1259 top comments and 4686 total comments as of the time I ran this. The JSON method returned 1259 top comments (that's good since it matches PRAW) and 4511 total comments (fewer than PRAW).

Note: When I started writing this question, I was getting many fewer comments for the JSON method, but adding sort=top to the comments URL (I've updated this in the code - sort=new also works). Even though the results are a lot closer now, I'm going to post in case anyone can point out why I'm not getting all of the comments.

I'd like to get 100% completeness if possible and the question might help others trying to scrape comments more efficiently.

score 0 · Answer 1 · answered Mar 17 '23 at 04:53

I was missing something fairly obvious, which is that the initial JSON response (i.e., not the morechildren requests but pulling the full initial page in JSON) has additional nested replies and comment ids.

This solution doesn't get exactly to 100% but it's pretty close. After retrying the same thread, I got 4701 total comments with PRAW and 4671 comments with the JSON method.

Trying for an older thread, I get 2366 for PRAW and 2339 for JSON: https://www.reddit.com/r/CryptoCurrency/comments/wr48rj/daily_general_discussion_august_18_2022_gmt0/

The solution below is adapted from: https://stackoverflow.com/a/66189132/6182971

The following lines in the OP:

all_comments = {c["data"]["id"]: c for c in body if c["kind"] == "t1"}
comment_ids = [c["data"]["id"] for c in body if c["kind"] == "t1"]
comment_ids += body[-1]["data"]["children"]

Become:

def traverse_comments(comments):
    for c in comments:
        yield c
        if isinstance(c["data"].get("replies"), dict):
            children = c["data"]["replies"]["data"]["children"].copy()
            yield from traverse_comments(children)


comments = [c for c in traverse_comments(body) if c["kind"] == "t1"]

all_comments = {c["data"]["id"]: c for c in comments if c["kind"] == "t1"}
comment_ids = list(all_comments.keys())
comment_ids += [c["data"]["id"] for c in all_comments.values()
                if isinstance(c["data"]["replies"], dict) and 
                isinstance(c["data"]["replies"]["data"]["children"][0]["data"].get("children"), list)]
comment_ids += body[-1]["data"]["children"]
comment_ids = list(set(comment_ids))

Python: Reddit PRAW extremely slow pulling comments for large thread

1 Answers1