Scraping a webpage using Python (beautiful soup) that requires "I agree to cookies" button being clicked?

Question

I'm trying to scrape the following URL for all football (soccer) matches for that day: https://www.soccerstats.com/matches.asp?matchday=2&daym=tomorrow

My code used to work but the website has since changed that you now need to click "I agree to cookies" button before the site loads the page. This is now causing issues with my code. Are there any solutions to this?

Any help is much appreciated.

I've tried looking at the text output from bs4 and its clear the site has not loaded, instead the "I agree to cookies" text can be seen in the output, which means it is not getting passed this stage.

from bs4 import BeautifulSoup
import requests

url = "https://www.soccerstats.com/matches.asp?matchday=2"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
all_matches = []

all_matches = re.findall(r"""<a class='button' style='background-color:#AAAAAA;font-color=white;' href='(.*?)'>""", data)

Output should list individual match url's.

Check what cookie is added to your browser when you click "I agree" and then add the cookie to your `requests.get()` call. — Michael Kolber, Jul 23 '19 at 20:01
No problem, sorry for not giving a full answer, I'll add one now for posterity. Feel free to accept it or add your own detailing how you did it. — Michael Kolber, Jul 24 '19 at 19:55

score 5 · Accepted Answer · answered Jul 24 '19 at 20:05

When you click on "I agree to cookies", the website sends a cookie to your browser that basically tells the website "This user has agreed to cookies." You can capture this cookie in something like Chrome's DevTools by opening up the Application tab and clicking "Cookies" on the left, and navigating to the website you're on.

Once you've done that, click "I agree to cookies" and see what cookies were added to your browser. On the website I'm looking at, one of the added cookies is called __hs_opt_out with a value of no. Then, you can simply add that cookie to your request:

r = requests.get(url, cookies={'__hs_opt_out': 'no'})

Or, even better:

s = requests.Session()
s.cookies.update({'__hs_opt_out': 'no'})
s.get(url)  # Automatically uses the session cookies

# Some more code...

s.get(other_url)  # Remembers the cookie from before

Scraping a webpage using Python (beautiful soup) that requires "I agree to cookies" button being clicked?

1 Answers1

Linked