1

I'm new to web scraping and I have a problem with when I try to scrape the posts from Reddit. It only shows me the top 3 or 4 results. as in this image

The code that I use is:

import time
from selenium import webdriver
from selenium.webdriver.edge.service import Service as EdgeService
from webdriver_manager.microsoft import EdgeChromiumDriverManager
import pandas as pd
import requests
from bs4 import BeautifulSoup


driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get("https://www.reddit.com/r/football/")
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

post_elements = soup.find_all('shreddit-post', class_='block cursor-pointer relative bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible')

post_data_list = []

for post_elem in post_elements:
    post_data = {} 
    
    post_data['post_title'] = post_elem['post-title']
    
    post_data['permalink'] = post_elem['permalink']
    
    post_data['author'] = post_elem['author']
    
    post_data['timestamp'] = post_elem.find('time')['datetime']
    
    post_data['score'] = post_elem['score']
    
    post_data['domain'] = post_elem['domain']
    
    post_data_list.append(post_data)

reddit_df = pd.DataFrame(post_data_list)
reddit_df # see the result in picture. 

Is there any way to get data from the rest of the posts on reddit? (there are more than 3 posts on the page).

i tried to open it in csv but still only three results.

After using the code from above i was expecting to see a bigger sheet with data from more posts.

If there is a limit for scraping on reddit, is there a way to bypass that limit?

Alex D
  • 13
  • 2
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Aug 12 '23 at 11:06

1 Answers1

0

To get data from reddit I suggest to look at the Json API they provide (add .json at the end of the URL):

import requests
from datetime import datetime


url = "https://old.reddit.com/r/football/.json"  # <-- note the .json at the end

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0"
}

data = requests.get(url, headers=headers).json()

for c in data["data"]["children"]:
    t = datetime.utcfromtimestamp(c["data"]["created_utc"]).strftime(
        "%Y-%m-%d %H:%M:%S"
    )

    print(f'{c["data"]["title"][:60]:<60} {c["data"]["ups"]:^5} {t}')

Prints:

r/Football Random Discussion Thread                            5   2023-08-01 21:00:30
Could have been a Pro footballer?                             59   2023-08-12 08:26:48
Letter from Man Utd's Female Fans Against Greenwood           802  2023-08-11 14:30:36
[Official] Harry Kane joins Bayern Munich                     16   2023-08-12 08:16:42
If Neymar goes back to Barca and wins them the UCL, howhch w  21   2023-08-12 04:10:32
Will Kroos be ever discussed as a legendary player?           68   2023-08-11 20:33:52
Best and worst World Cup songs?                                5   2023-08-12 10:07:16
Do you think Liverpool paid to much for moises caicedo.       159  2023-08-11 10:40:34
Lewandowski vs squarez. Who is the better striker of the las  50   2023-08-11 16:17:22
Brazil vs Argentina youth development                          3   2023-08-12 07:25:37
Why are we (Sweden) so good at the ladies' football, but muc  76   2023-08-11 12:29:55
What are some of the most one sided rivalries?                 7   2023-08-11 23:41:44
My Dream Team according to me (comment your suggestions)(12    0   2023-08-12 08:04:02
Exciting News for Soccer Fans in the United States!            0   2023-08-12 11:09:08
Best player to have played in the English Prem?                5   2023-08-11 21:51:15
What's the worst (or least good) club you believe could win    0   2023-08-12 05:11:53
What happened with Julian Nagelssman?                         194  2023-08-10 23:38:47
Real Madrid fans, which player from current/past Barca do yo  35   2023-08-11 09:13:00
The next era of the goat debate                                0   2023-08-12 02:52:22
Lionel Messi at Inter Miami CF (Match 5)                       1   2023-08-12 02:34:24
I accidentally learned how to knuckleball a football as a be   0   2023-08-12 01:39:38
Another act of racism against black Brazilians in South Amer  12   2023-08-11 12:02:54
Why do people think players need to be good at everything to   0   2023-08-12 01:26:07
Haaland tap in merchant my ass                                 2   2023-08-11 19:52:32
Do you guys think Klaksvik has a chance at qualifying for th   0   2023-08-11 22:46:22
Is this pair of football Boots good? I want good shooting an   1   2023-08-11 22:34:14
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Can you please explain when do you decide to add a header like `User-Agent` to the request ? –  Aug 12 '23 at 11:34
  • @rendezvous Because without IT reddit will throttle the traffic after few requests and not send any data. – Andrej Kesely Aug 12 '23 at 11:35
  • So it means that I need to know the websites by their names (reddit, ..) to include or not the header ? –  Aug 12 '23 at 11:36
  • 1
    @rendezvous Every website handles the traffic different, so you have to tailor the code to each of them differently. – Andrej Kesely Aug 12 '23 at 11:37