I've written a script using scrapy
in combination with selenium
to parse the name of CEO
's of different companies from a webpage. You can find the name of different companies in the landing page. However, you can get the name of CEO
's once you click on the name of the company links.
The following script can parse the links of different companies and use those links to scrape the names of CEO
'S except for the second company. When the script tries to parse the name of CEO
using the link of the second company, it encounters stale element reference error
. The script fetches the rest of the results in the right way even when It encountered that error along the way. Once again - it only throws error parsing the information using the second company link. How weird!!
This is what I've tried so far with:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class FortuneSpider(scrapy.Spider):
name = 'fortune'
url = 'http://fortune.com/fortune500/list/'
def start_requests(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver,10)
yield scrapy.Request(self.url,callback=self.get_links)
def get_links(self,response):
self.driver.get(response.url)
for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))):
company_link = item.find_element_by_css_selector('a[class*="searchResults__cellWrapper--"]').get_attribute("href")
yield scrapy.Request(company_link,callback=self.get_inner_content)
def get_inner_content(self,response):
self.driver.get(response.url)
chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text
yield {'CEO': chief_executive}
This is the type of results I'm getting:
Jeffrey P. Bezos
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=76.0.3809.132)
Darren W. Woods
Timothy D. Cook
Warren E. Buffett
Brian S. Tyler
C. Douglas McMillon
David S. Wichmann
Randall L. Stephenson
Steven H. Collis
and so on------------
How can I fix the error that my script encounters while dealing with the second company link?
PS I can use their api to get all the information but I'm curious to know why this weird trouble the above script is facing.