Reddit isn't scraping the top comments (python/selenium)

Question

Put the entire code into a question, thank you to all that have replied but this issue is super annoying either way help is appreciated!

Context: This code is meant to go onto the top reddit post of the day/week, then screenshot it and once that's done it goes to the comments and screenshots the top comments of said post, the former works but the latter does not.

import time,utils,string
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from utils import config

def scrape(post_url):
bot = utils.create_bot(headless=True)
data = {}

try:
    # Load cookies to prevent cookie overlay & other issues
    bot.get('https://www.reddit.com')
    for cookie in config['reddit_cookies'].split('; '):
        cookie_data = cookie.split('=')
        bot.add_cookie({'name':cookie_data[0],'value':cookie_data[1],'domain':'reddit.com'})
    bot.get(post_url)

    # Fetching the post itself, text & screenshot
    post = WebDriverWait(bot, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.Post')))
    post_text = post.find_element(By.CSS_SELECTOR, 'h1').text
    data['post'] = post_text
    post.screenshot('output/post.png')

    # Let comments load
    bot.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
    
    # Fetching comments & top level comment determinator
    comments = WebDriverWait(bot, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div[id^=t1_][tabindex]')))
    allowed_style = comments[0].get_attribute("style")
    
    # Filter for top only comments
    NUMBER_OF_COMMENTS = 10
    comments = [comment for comment in comments if comment.get_attribute("style") == allowed_style][:NUMBER_OF_COMMENTS]

    print(' Scraping comments...',end="",flush=True)
    # Save time & resources by only fetching X content
    for i in range(len(comments)):
        try:
            print('.',end="",flush=True)
            # Filter out locked comments (AutoMod) 
            try:
                comments[i].find_elements(By.CSS_SELECTOR, '.icon.icon-lock_fill')
                continue
            except:
                pass

            # Scrolling to the comment ensures that the profile picture loads
            # Credits: https://stackoverflow.com/a/57630350
            desired_y = (comments[i].size['height'] / 2) + comments[i].location['y']
            window_h = bot.execute_script('return window.innerHeight')
            window_y = bot.execute_script('return window.pageYOffset')
            current_y = (window_h / 2) + window_y
            scroll_y_by = desired_y - current_y

            bot.execute_script("window.scrollBy(0, arguments[0]);", scroll_y_by)
            time.sleep(0.2)

            # Getting comment into string
            text = "\n".join([element.text for element in comments[i].find_elements_by_css_selector('.RichTextJSON-root')])

            # Screenshot & save text
            comments[i].screenshot(f'output/{i}.png')
            data[str(i)] = ''.join(filter(lambda c: c in string.printable, text))
        except Exception as e:
            if config['debug']:
                raise e
            pass

    if bot.session_id:
        bot.quit()
    return data
except Exception as e:
    if bot.session_id:
        bot.quit()
    if config['debug']:
        raise e
    return False

@platipus_on_fire It works according with the top reddit posts in a subreddit that week — Multiverse, Aug 02 '22 at 22:26
@platipus_on_fire For example I ran the bot and the URL was https://www.reddit.com/r/AskReddit/comments/wclubp/what_addiction_is_seen_as_completely_normal_by/ The post itself was screenshotted but the comments were not — Multiverse, Aug 02 '22 at 22:27
Is this the sort of results you are looking for? https://ibb.co/mCmpdR5 https://ibb.co/G7282fD https://ibb.co/31KQYWh https://ibb.co/HnCqZKm — Barry the Platipus, Aug 02 '22 at 23:46

Barry the Platipus · Answer 1 · 2022-08-02T22:53:54.003

It's a good practice to wait for elements to load in page (be clickable) before locating them. For example, try locating comments with:

comments = WebDriverWait(bot, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div[id^=t1_][tabindex]')))

You will also need to import:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

The following code will print out the top 5 comments (using waits and a correct locator):

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

url = 'https://www.reddit.com/r/AskReddit/comments/wclubp/what_addiction_is_seen_as_completely_normal_by/'


browser.get(url)

elems = WebDriverWait(browser,10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='post-comment-header']/parent::div")))
for e in elems[:5]:
    print(e.text)
    print('______________')

Print out in terminal:

level 1
Ill_Animator_4437
·
2 days ago
infinite scrolling in apps
2.4k
Reply
Share
Report
Save
Follow
______________
level 2
AnOperative
·
2 days ago
Why I use reddit on the website not the app, I can tell my self I'm having a 3 page break then I go back to whatever i was doin
37
Reply
Share
Report
Save
Follow
______________

I tried that and it still didn't scrape the comments it skipped it as usual — Multiverse, Aug 02 '22 at 22:38
I'm not sure about your xpath, didn't test it. Did you also waited comments in this line `comments[i].find_elements(By.CSS_SELECTOR, '.icon.icon-lock_fill')` ? — Barry the Platipus, Aug 02 '22 at 22:40
I appreciate the example and whilst this does work the code was designed to screenshot not print, I will post the entire code onto the original question — Multiverse, Aug 02 '22 at 23:04
So then screenshot it after you print it in the terminal, give it a time.sleep(1) for good measure, screenshot it, then move to the next one. The crux of the matter is waiting it and using a correct locator. — Barry the Platipus, Aug 02 '22 at 23:05

score 0 · Accepted Answer · answered Aug 02 '22 at 23:40

0

Code was fixed by removing code which filters locked comments

answered Aug 02 '22 at 23:40

Multiverse

13
3

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 04 '22 at 12:32

Reddit isn't scraping the top comments (python/selenium)

2 Answers2