Script throws an error while using a particular link among several

Question

I've written a script using scrapy in combination with selenium to parse the name of CEO's of different companies from a webpage. You can find the name of different companies in the landing page. However, you can get the name of CEO's once you click on the name of the company links.

The following script can parse the links of different companies and use those links to scrape the names of CEO'S except for the second company. When the script tries to parse the name of CEO using the link of the second company, it encounters stale element reference error. The script fetches the rest of the results in the right way even when It encountered that error along the way. Once again - it only throws error parsing the information using the second company link. How weird!!

The webpage link

This is what I've tried so far with:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class FortuneSpider(scrapy.Spider):

    name = 'fortune'
    url = 'http://fortune.com/fortune500/list/'

    def start_requests(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver,10)
        yield scrapy.Request(self.url,callback=self.get_links)

    def get_links(self,response):
        self.driver.get(response.url)
        for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))):
            company_link = item.find_element_by_css_selector('a[class*="searchResults__cellWrapper--"]').get_attribute("href")
            yield scrapy.Request(company_link,callback=self.get_inner_content)

    def get_inner_content(self,response):
        self.driver.get(response.url)
        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text
        yield {'CEO': chief_executive}

This is the type of results I'm getting:

Jeffrey P. Bezos

raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=76.0.3809.132)

Darren W. Woods
Timothy D. Cook
Warren E. Buffett
Brian S. Tyler
C. Douglas McMillon
David S. Wichmann
Randall L. Stephenson
Steven H. Collis
and so on------------

How can I fix the error that my script encounters while dealing with the second company link?

PS I can use their api to get all the information but I'm curious to know why this weird trouble the above script is facing.

@AndiCover the question you linked got a `StaleReferenceError` due to a copy/paste typo, and should probably be closed. If you see a typo in OP's code, it would be nice of you to point it out, then flag the question as off topic (but not as a duplicate). Marking this as a duplicate does nothing but waste everyone's time. — Lord Elrond, Sep 26 '19 at 06:16

SIM · Answer 1 · 2019-09-26T02:00:21.670

A slightly modified approach should get you all the desired content from that site without any issues. All you need to do is store all the target links as a list within get_links() method and use return or yield while making callback to get_inner_content() method. You can also disable the images to make the script slightly faster.

The following attempt should get you all the results:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.crawler import CrawlerProcess

class FortuneSpider(scrapy.Spider):

    name = 'fortune'
    url = 'http://fortune.com/fortune500/list/'

    def start_requests(self):
        option = webdriver.ChromeOptions()
        chrome_prefs = {}
        option.experimental_options["prefs"] = chrome_prefs
        chrome_prefs["profile.default_content_settings"] = {"images": 2}
        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}

        self.driver = webdriver.Chrome(options=option)
        self.wait = WebDriverWait(self.driver,10)
        yield scrapy.Request(self.url,callback=self.get_links)

    def get_links(self,response):
        self.driver.get(response.url)
        item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]
        return [scrapy.Request(link,callback=self.get_inner_content) for link in item_links]

    def get_inner_content(self,response):
        self.driver.get(response.url)
        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text
        yield {'CEO': chief_executive}

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(FortuneSpider)
    process.start()

Or using yield:

def get_links(self,response):
    self.driver.get(response.url)
    item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]
    for link in item_links:
        yield scrapy.Request(link,callback=self.get_inner_content)

score 0 · Answer 2 · answered Sep 24 '19 at 22:05

You are getting a Stale Element Exception because on line 24 you are navigating away from the original page.

    def get_inner_content(self,response):
        self.driver.get(response.url)
        ...

Since you are looping through the links on line 19...

 for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))):

Any subsequent access to item will be rendered as a Stale Element Exception if you try to perform an access on it since the page was navigated away from in the context of driver.

Your script will work for "Walmart" since it's the first item. You are getting this error on Exxon Mobil because the page was then navigated away from on line 24.

I would be satisfied if i could see that as soon as the script throws the above error, it does not work for the rest of the links but the thing is it throws that error only for the second link. The rest are getting parsed in the right way. Refer to the result above. — robots.txt, Sep 25 '19 at 10:40

score 0 · Answer 3 · answered Sep 25 '19 at 12:19

To parse the name of CEO's of different companies from the webpage https://fortune.com/fortune500/search/ Selenium alone itself would be enough and you need to:

Scroll to the last item on the webpage.
Collect the href attributes and store in a list.
Open the hrefs in the adjascent tab

Switch the focus to the newly opened tab and induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategies:

Code Block:

# -*- coding: UTF-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://fortune.com/fortune500/search/")
driver.execute_script("arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Explore Lists from Other Years']"))))
my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(@class, 'searchResults__cellWrapper--') and contains(@href, 'fortune500')][.//span/div]")))]
windows_before  = driver.current_window_handle
for my_href in my_hrefs:
    driver.execute_script("window.open('" + my_href +"');")
    WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
    windows_after = driver.window_handles
    new_window = [x for x in windows_after if x != windows_before][0]
    driver.switch_to_window(new_window)
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table/tbody/tr//td[starts-with(@class, 'dataTable__value')]/div"))).text)
    driver.close() # close the window
    driver.switch_to.window(windows_before) # switch_to the parent_window_handle
driver.quit()

Console Output:

C. Douglas McMillon
Darren W. Woods
Timothy D. Cook
Warren E. Buffett
Jeffrey P. Bezos
David S. Wichmann
Brian S. Tyler
Larry J. Merlo
Randall L. Stephenson
Steven H. Collis
Michael K. Wirth
James P. Hackett
Mary T. Barra
W. Craig Jelinek
Larry Page
Michael C. Kaufmann
Stefano Pessina
James Dimon
Hans E. Vestberg
W. Rodney McMullen
H. Lawrence Culp Jr.
Hugh R. Frater
Greg C. Garland
Joseph W. Gorder
Brian T. Moynihan
Satya Nadella
Craig A. Menear
Dennis A. Muilenburg
C. Allen Parker
Michael L. Corbat
Gary R. Heminger
Brian L. Roberts
Gail K. Boudreaux
Michael S. Dell
Marc Doyle
Michael L. Tipsord
Alex Gorsky
Virginia M. Rometty
Brian C. Cornell
Donald H. Layton
David P. Abney
Marvin R. Ellison
Robert H. Swan
Michel A. Khalaf
David S. Taylor
Gregory J. Hayes
Frederick W. Smith
Ramon L. Laguarta
Juan R. Luciano
.
.
.

Sers · Answer 4 · 2019-09-29T07:51:14.033

Here is how you can get companies details without using Selenium much faster and lighter.
See how I get company_name and change_the_world to extract other details.

import requests
from bs4 import BeautifulSoup
import re
import html

with requests.Session() as session:
    response = session.get("https://content.fortune.com/wp-json/irving/v1/data/franchise-search-results?list_id=2611932")
    items = response.json()[1]["items"]
    for item in items:
        company_name = html.unescape(list(filter(lambda x: x['key'] == 'name', item["fields"]))[0]["value"])
        change_the_world = list(filter(lambda x: x['key'] == 'change-the-world-y-n', item["fields"]))[0]["value"]

        response = session.get(item["permalink"])
        preload_data = BeautifulSoup(response.text, "html.parser").select_one("#preload").text
        ceo = re.search('"ceo","value":"(.*?)"', preload_data).groups()[0]

        print(f"Company: {company_name}, CEO: {ceo}, Change The World: {change_the_world}")

Result:

Company: Carvana, CEO: Ernest C. Garcia, Change The World: no
Company: ManTech International, CEO: Kevin M. Phillips, Change The World: no
Company: NuStar Energy, CEO: Bradley C. Barron, Change The World: no
Company: Shutterfly, CEO: Ryan O’Hara, Change The World: no
Company: Spire, CEO: Suzanne Sitherwood, Change The World: no
Company: Align Technology, CEO: Joseph M. Hogan, Change The World: no
Company: Herc Holdings, CEO: Lawrence H. Silber, Change The World: no
...

Script throws an error while using a particular link among several

4 Answers4