I'm trying to scrape the title of different posts from their innerpages from a website using scrapy in combination with selenium, although the content of this site is static. The script grabs the links of different posts from the landing page and reuses the newly parsed links to fetch titles from their innerpages.
I know there is a library which is meant for using selenium within scrapy. However, I'm not interested in using that library for this basic use case.
There are two methods within the following spider. I could stick with one method to do the whole thing but I used two methods here to understand how can I pass the links from one method to another in order to do the rest of the things within latter method without making any requests, as in scrapy.Request()
.
The script appears to be working correctly. The script also works correctly if I kick out yield
from here yield self.parse_from_innerpage(response.urljoin(elem))
and use like self.parse_from_innerpage(response.urljoin(elem))
.
Question: should I use yield or not within parse method to go on with current implementation as they both work interchangeably?
I've tried with:
import scrapy
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver,15)
def parse(self, response):
for elem in response.css(".summary .question-hyperlink::attr(href)").getall():
yield self.parse_from_innerpage(response.urljoin(elem))
def parse_from_innerpage(self,link):
self.driver.get(link)
elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a")))
print(elem.text)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(StackoverflowSpider)
c.start()