0

I have the following MRE code:

from selenium import webdriver
from pathlib import Path
from bs4 import BeautifulSoup as bs
from time import sleep

options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_argument("--blink-settings=imagesEnabled=false")
options.add_argument("--log-level=3")
options.add_argument("--ignore-certificate-errors")
options.add_argument("--ignore-ssl-errors")
options.add_argument("no-sandbox")
chrome_driver = str(Path("chromedriver/chromedriver/"))
driver = webdriver.Chrome(chrome_driver, chrome_options=options)


def get_soup(url):
    driver.get(url)
    sleep(0.01)
    html = driver.page_source
    return bs(html, "html.parser")


urls = [
    'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/2/',
    'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/3/',
    'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/4/',
    'https://www.oddsportal.com/tennis/united-kingdom/atp-wimbledon-2010/results/#/page/5/',
]
soups = [get_soup(url) for url in urls]

When I run this and then examine soups[3] it isn't scraping the 4th url in the urls list. It seems to be scraping the 2nd url in the list instead.

If I increase the sleep time to 1 then it scrapes the correct page.

Why is this happening and is there a way to mitigate this without having to add a significant pause into the code? I have hundreds of thousands of pages to scrape...

I thought about learning how to check for an element on the page before scraping but these pages are all extremely similar except for the matches. As such, if I was to search for, e.g. the table containing the matches, I'd still get a 'hit' in my MRE above even though it's the wrong page. Perhaps there's a way to check that the correct URL has loaded before scraping it?


Update:

In line with Zein's answer I tried implementing a while loop to check the URL but I'm still getting the same issue. Updated function is below:

def get_soup(url):
    while driver.current_url != url:
        driver.get(url)
    sleep(0.01)
    html = driver.page_source
    return bs(html, "html.parser")
Jossy
  • 589
  • 2
  • 12
  • 36
  • If you want to save time why do you use Selenium? use requests instead – Parolla Dec 14 '20 at 20:27
  • Hey. It was a while ago that I made the decision to go with selenium. I think it was something to do with the OddsPortal webpages having some funky stuff on them that requests couldn't handle. – Jossy Dec 14 '20 at 20:29
  • 1
    This is what webdriverwaits are designed for. What you have here is the same URL with different values after the hash... (which the browser will process as a bookmark so no page load will occur). The javascript will update the DOM. Use webdriverwaits with expectedconditions to wait until the elements you want have been updated. – pcalkins Dec 14 '20 at 20:52
  • @pcalkins - hey and thanks. I mentioned this in my question as I wasn't sure what elements to test for. The pages are near identical apart from the matches and I don't know what the matches are going to be before I scrape the page. I realise you're not familiar with the pages so me asking what you'd recommend is going to be challenging! – Jossy Dec 14 '20 at 21:03
  • in this case I think you'll want to check for changes to the pagination buttons... the current page's button shouldn't be active. – pcalkins Dec 14 '20 at 21:10
  • @pcalkins - thanks. Afraid I'm very much a newb in this area. Couple of questions - 1. Why should I want it to inactive - does active not mean I'm on the right page? 2. How would I test if the current page's button is active or not? Looking through the html I can see this line: `3` - presumably this means page 3 IS active? – Jossy Dec 14 '20 at 21:34
  • I just mean clicking it probably doesn't do anything. The other links may have anchor tags. It may also have a different style... just use whatever is different in the pagination links for this particular site. – pcalkins Dec 14 '20 at 22:50
  • @pcalkins - I'm really sorry but I'm not sure what you're recommending. The current limit of my web scraping ability is reading HTML into Beautiful Soup. Is there anyway you could jot an answer down or point me in the direction of a simple point of reference? – Jossy Dec 14 '20 at 23:45
  • If you don't want to use the WebDriverWait, just put in a sleep of a couple seconds. Just something that accounts for the time the site takes to update. (0.01 is much too short) – pcalkins Dec 15 '20 at 00:23

1 Answers1

0

You can try comparing the current URL to the URL from the list to make sure you are on the correct URL by adding this to the loop:

if driver.current_url == url:
    continue   # continue your program

Edit: If you are concerned about the time, you can just use threading module to open the URLs separately and not worry about waiting for a previous one to load.

K. B.
  • 3,342
  • 3
  • 19
  • 32