I have the following MRE code:
from selenium import webdriver
from pathlib import Path
from bs4 import BeautifulSoup as bs
from time import sleep
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_argument("--blink-settings=imagesEnabled=false")
options.add_argument("--log-level=3")
options.add_argument("--ignore-certificate-errors")
options.add_argument("--ignore-ssl-errors")
options.add_argument("no-sandbox")
chrome_driver = str(Path("chromedriver/chromedriver/"))
driver = webdriver.Chrome(chrome_driver, chrome_options=options)
def get_soup(url):
driver.get(url)
sleep(0.01)
html = driver.page_source
return bs(html, "html.parser")
urls = [
'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/2/',
'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/3/',
'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/4/',
'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/5/',
]
soups = [get_soup(url) for url in urls]
When I run this and then examine soups[3]
it isn't scraping the 4th url
in the urls
list. It seems to be scraping the 2nd url
in the list instead.
If I increase the sleep
time to 1
then it scrapes the correct page.
Why is this happening and is there a way to mitigate this without having to add a significant pause into the code? I have hundreds of thousands of pages to scrape...
I thought about learning how to check for an element on the page before scraping but these pages are all extremely similar except for the matches. As such, if I was to search for, e.g. the table containing the matches, I'd still get a 'hit' in my MRE above even though it's the wrong page. Perhaps there's a way to check that the correct URL has loaded before scraping it?
Update:
In line with Zein's answer I tried implementing a while loop to check the URL but I'm still getting the same issue. Updated function is below:
def get_soup(url):
while driver.current_url != url:
driver.get(url)
sleep(0.01)
html = driver.page_source
return bs(html, "html.parser")