5

I'm trying to scrape the following URL for all football (soccer) matches for that day: https://www.soccerstats.com/matches.asp?matchday=2&daym=tomorrow

My code used to work but the website has since changed that you now need to click "I agree to cookies" button before the site loads the page. This is now causing issues with my code. Are there any solutions to this?

Any help is much appreciated.

I've tried looking at the text output from bs4 and its clear the site has not loaded, instead the "I agree to cookies" text can be seen in the output, which means it is not getting passed this stage.

from bs4 import BeautifulSoup
import requests

url = "https://www.soccerstats.com/matches.asp?matchday=2"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
all_matches = []

all_matches = re.findall(r"""<a class='button' style='background-color:#AAAAAA;font-color=white;' href='(.*?)'>""", data)

Output should list individual match url's.

FrostyX
  • 75
  • 1
  • 6

1 Answers1

5

When you click on "I agree to cookies", the website sends a cookie to your browser that basically tells the website "This user has agreed to cookies." You can capture this cookie in something like Chrome's DevTools by opening up the Application tab and clicking "Cookies" on the left, and navigating to the website you're on.

Once you've done that, click "I agree to cookies" and see what cookies were added to your browser. On the website I'm looking at, one of the added cookies is called __hs_opt_out with a value of no. Then, you can simply add that cookie to your request:

r = requests.get(url, cookies={'__hs_opt_out': 'no'})

Or, even better:

s = requests.Session()
s.cookies.update({'__hs_opt_out': 'no'})
s.get(url)  # Automatically uses the session cookies

# Some more code...

s.get(other_url)  # Remembers the cookie from before
Michael Kolber
  • 1,309
  • 1
  • 14
  • 23