0

Background

I've been building a web scraper with Python 3.8 and Selenium. On both my systems (a 2012 Macbook Pro and a new fairly beefy PC build), it will scrape three similarly-structured websites in about 2 hours. For each site, I'm scraping data from individual product listings using the Firefox Webdriver. Each listing from those sites takes about 3-4 seconds on average, never taking above 5 seconds.

Problem

I recently sent it to my partner so he could test it out on his system (not sure on exact model, but a new mid-tier laptop). For the first two websites, it works smoothly and very comparably to my PC. However, on the third website, about 5% of the listings will get stuck for anywhere from 10 seconds to indefinitely. The following two sections of code appear to be where we're getting held up:

try:
    item = browser.find_element_by_xpath("//div[@class='outer class']/div[@class='inner class']").text
except NoSuchElementException:
    item = ""

and the following, which is in a loop of all URLs to be visited:

try:
    browser.get(url)
except TimeoutException:
    continue

What I've Considered

I'm new to all of this, so I'm not sure what the possible errors are. Here's what I've considered:

His laptop is subpar: Doesn't seem to be the problem. It's definitely an improvement over my Macbook Pro, which works fine. Also, 2 out of 3 websites work perfectly for him.

His internet connection is subpar: Could be an issue, but is comparing Ookla results enough to determine this? And why would this only make a difference on one website?

His browser/python is out of date: I checked. They're not.

He has too many processes running on his PC: We did a reset and closed all unnecessary programs. Didn't seem to make a difference.

My code is subpar: This is entirely possible/probable.

Captcha issues: On my side, I haven't seen a captcha. But since it's getting hung up on either 1) loading the next URL, or 2) accessing the first bit of data from that URL, it seems possible.

Any and all help is much appreciated!

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
DRo
  • 3
  • 1
  • Few more things to check/rule out: when it's stuck, is that xpath available on the page? ( i.e. can you see it in Dev tools) - Do you both have the same Browser version? Do you set an implicit wait on any load time waits or have any large sync times in general? Do you both use the same resolution? (some sites are reactive and differnt window sizes can impact xpaths) – RichEdwards Jul 14 '20 at 12:17

1 Answers1

0

Ideally, to retrieve the innerText you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    item = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.outer.class > div.inner.class"))).text
    
  • Using XPATH:

    item = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='outer class']/div[@class='inner class']"))).text
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

References

You can find a couple of relevant discussions on NoSuchElementException in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352