2

I've written a script using scrapy in combination with selenium to parse the name of CEO's of different companies from a webpage. You can find the name of different companies in the landing page. However, you can get the name of CEO's once you click on the name of the company links.

The following script can parse the links of different companies and use those links to scrape the names of CEO'S except for the second company. When the script tries to parse the name of CEO using the link of the second company, it encounters stale element reference error. The script fetches the rest of the results in the right way even when It encountered that error along the way. Once again - it only throws error parsing the information using the second company link. How weird!!

The webpage link

This is what I've tried so far with:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class FortuneSpider(scrapy.Spider):

    name = 'fortune'
    url = 'http://fortune.com/fortune500/list/'

    def start_requests(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver,10)
        yield scrapy.Request(self.url,callback=self.get_links)

    def get_links(self,response):
        self.driver.get(response.url)
        for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))):
            company_link = item.find_element_by_css_selector('a[class*="searchResults__cellWrapper--"]').get_attribute("href")
            yield scrapy.Request(company_link,callback=self.get_inner_content)

    def get_inner_content(self,response):
        self.driver.get(response.url)
        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text
        yield {'CEO': chief_executive}

This is the type of results I'm getting:

Jeffrey P. Bezos

raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=76.0.3809.132)

Darren W. Woods
Timothy D. Cook
Warren E. Buffett
Brian S. Tyler
C. Douglas McMillon
David S. Wichmann
Randall L. Stephenson
Steven H. Collis
and so on------------

How can I fix the error that my script encounters while dealing with the second company link?

PS I can use their api to get all the information but I'm curious to know why this weird trouble the above script is facing.

robots.txt
  • 96
  • 2
  • 10
  • 36
  • @AndiCover the question you linked got a `StaleReferenceError` due to a copy/paste typo, and should probably be closed. If you see a typo in OP's code, it would be nice of you to point it out, then flag the question as off topic (but not as a duplicate). Marking this as a duplicate does nothing but waste everyone's time. – Lord Elrond Sep 26 '19 at 06:16

4 Answers4

2

A slightly modified approach should get you all the desired content from that site without any issues. All you need to do is store all the target links as a list within get_links() method and use return or yield while making callback to get_inner_content() method. You can also disable the images to make the script slightly faster.

The following attempt should get you all the results:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.crawler import CrawlerProcess

class FortuneSpider(scrapy.Spider):

    name = 'fortune'
    url = 'http://fortune.com/fortune500/list/'

    def start_requests(self):
        option = webdriver.ChromeOptions()
        chrome_prefs = {}
        option.experimental_options["prefs"] = chrome_prefs
        chrome_prefs["profile.default_content_settings"] = {"images": 2}
        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}

        self.driver = webdriver.Chrome(options=option)
        self.wait = WebDriverWait(self.driver,10)
        yield scrapy.Request(self.url,callback=self.get_links)

    def get_links(self,response):
        self.driver.get(response.url)
        item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]
        return [scrapy.Request(link,callback=self.get_inner_content) for link in item_links]

    def get_inner_content(self,response):
        self.driver.get(response.url)
        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text
        yield {'CEO': chief_executive}

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(FortuneSpider)
    process.start()

Or using yield:

def get_links(self,response):
    self.driver.get(response.url)
    item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]
    for link in item_links:
        yield scrapy.Request(link,callback=self.get_inner_content) 
SIM
  • 21,997
  • 5
  • 37
  • 109
0

You are getting a Stale Element Exception because on line 24 you are navigating away from the original page.

    def get_inner_content(self,response):
        self.driver.get(response.url)
        ...

Since you are looping through the links on line 19...

 for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))):

Any subsequent access to item will be rendered as a Stale Element Exception if you try to perform an access on it since the page was navigated away from in the context of driver.

Your script will work for "Walmart" since it's the first item. You are getting this error on Exxon Mobil because the page was then navigated away from on line 24.

ddavison
  • 28,221
  • 15
  • 85
  • 110
  • I would be satisfied if i could see that as soon as the script throws the above error, it does not work for the rest of the links but the thing is it throws that error only for the second link. The rest are getting parsed in the right way. Refer to the result above. – robots.txt Sep 25 '19 at 10:40
0

To parse the name of CEO's of different companies from the webpage https://fortune.com/fortune500/search/ Selenium alone itself would be enough and you need to:

  • Scroll to the last item on the webpage.
  • Collect the href attributes and store in a list.
  • Open the hrefs in the adjascent tab
  • Switch the focus to the newly opened tab and induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategies:

    • Code Block:

      # -*- coding: UTF-8 -*-
      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      
      options = webdriver.ChromeOptions()
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get("https://fortune.com/fortune500/search/")
      driver.execute_script("arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Explore Lists from Other Years']"))))
      my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(@class, 'searchResults__cellWrapper--') and contains(@href, 'fortune500')][.//span/div]")))]
      windows_before  = driver.current_window_handle
      for my_href in my_hrefs:
          driver.execute_script("window.open('" + my_href +"');")
          WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
          windows_after = driver.window_handles
          new_window = [x for x in windows_after if x != windows_before][0]
          driver.switch_to_window(new_window)
          print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table/tbody/tr//td[starts-with(@class, 'dataTable__value')]/div"))).text)
          driver.close() # close the window
          driver.switch_to.window(windows_before) # switch_to the parent_window_handle
      driver.quit()
      
    • Console Output:

      C. Douglas McMillon
      Darren W. Woods
      Timothy D. Cook
      Warren E. Buffett
      Jeffrey P. Bezos
      David S. Wichmann
      Brian S. Tyler
      Larry J. Merlo
      Randall L. Stephenson
      Steven H. Collis
      Michael K. Wirth
      James P. Hackett
      Mary T. Barra
      W. Craig Jelinek
      Larry Page
      Michael C. Kaufmann
      Stefano Pessina
      James Dimon
      Hans E. Vestberg
      W. Rodney McMullen
      H. Lawrence Culp Jr.
      Hugh R. Frater
      Greg C. Garland
      Joseph W. Gorder
      Brian T. Moynihan
      Satya Nadella
      Craig A. Menear
      Dennis A. Muilenburg
      C. Allen Parker
      Michael L. Corbat
      Gary R. Heminger
      Brian L. Roberts
      Gail K. Boudreaux
      Michael S. Dell
      Marc Doyle
      Michael L. Tipsord
      Alex Gorsky
      Virginia M. Rometty
      Brian C. Cornell
      Donald H. Layton
      David P. Abney
      Marvin R. Ellison
      Robert H. Swan
      Michel A. Khalaf
      David S. Taylor
      Gregory J. Hayes
      Frederick W. Smith
      Ramon L. Laguarta
      Juan R. Luciano
      .
      .
      .
      
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

Here is how you can get companies details without using Selenium much faster and lighter.
See how I get company_name and change_the_world to extract other details.

import requests
from bs4 import BeautifulSoup
import re
import html

with requests.Session() as session:
    response = session.get("https://content.fortune.com/wp-json/irving/v1/data/franchise-search-results?list_id=2611932")
    items = response.json()[1]["items"]
    for item in items:
        company_name = html.unescape(list(filter(lambda x: x['key'] == 'name', item["fields"]))[0]["value"])
        change_the_world = list(filter(lambda x: x['key'] == 'change-the-world-y-n', item["fields"]))[0]["value"]

        response = session.get(item["permalink"])
        preload_data = BeautifulSoup(response.text, "html.parser").select_one("#preload").text
        ceo = re.search('"ceo","value":"(.*?)"', preload_data).groups()[0]

        print(f"Company: {company_name}, CEO: {ceo}, Change The World: {change_the_world}")

Result:

Company: Carvana, CEO: Ernest C. Garcia, Change The World: no
Company: ManTech International, CEO: Kevin M. Phillips, Change The World: no
Company: NuStar Energy, CEO: Bradley C. Barron, Change The World: no
Company: Shutterfly, CEO: Ryan O’Hara, Change The World: no
Company: Spire, CEO: Suzanne Sitherwood, Change The World: no
Company: Align Technology, CEO: Joseph M. Hogan, Change The World: no
Company: Herc Holdings, CEO: Lawrence H. Silber, Change The World: no
...

Sers
  • 12,047
  • 2
  • 12
  • 31