1

I'm trying to scrape the title of different posts from their innerpages from a website using scrapy in combination with selenium, although the content of this site is static. The script grabs the links of different posts from the landing page and reuses the newly parsed links to fetch titles from their innerpages.

I know there is a library which is meant for using selenium within scrapy. However, I'm not interested in using that library for this basic use case.

There are two methods within the following spider. I could stick with one method to do the whole thing but I used two methods here to understand how can I pass the links from one method to another in order to do the rest of the things within latter method without making any requests, as in scrapy.Request().

The script appears to be working correctly. The script also works correctly if I kick out yield from here yield self.parse_from_innerpage(response.urljoin(elem)) and use like self.parse_from_innerpage(response.urljoin(elem)).

Question: should I use yield or not within parse method to go on with current implementation as they both work interchangeably?

I've tried with:

import scrapy
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"

    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver,15)

    def parse(self, response):
        for elem in response.css(".summary .question-hyperlink::attr(href)").getall():
            yield self.parse_from_innerpage(response.urljoin(elem))

    def parse_from_innerpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a")))
        print(elem.text)


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
        'LOG_LEVEL':'ERROR',
    })
    c.crawl(StackoverflowSpider)
    c.start()
SMTH
  • 67
  • 1
  • 4
  • 17
  • 1
    Your `parse_from_innerpage` method doesn't return anything (meaning it returns a None). This probably causes a generator that returns many None objects. Instead of using print, return elem.text – Chaos Monkey Dec 27 '20 at 10:13

2 Answers2

2
def parse_from_innerpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a")))
        print(elem.text)

Before trying to answer your question, I want to add a slight note about your print() statement inside the parse_from_innerpage(self, link) function. Though this topic is controversially discussed it's a better practice to return the object in question - you can later decide to print it by assigning a class instance (and then calling the method) or just directly by Class.Method(**args)

That out of the way - let's tackle your question:

Should I use yield or not within parse method to go on with current implementation as they both work interchangeably?

To answer this - or rather understand the reply - I highly suggest to look at some qualitative resources in order to grasp the concept of yield. What follows are a few links for further reading purposes:

Basically, return send a specified value back, yield on the other hand produces a sequence of values - hence perfect for iterating over the values you obtain.


Let's look at your code:

parse(self, response) method:

def parse(self, response):
        for elem in response.css(".summary .question-hyperlink::attr(href)").getall():
            yield self.parse_from_innerpage(response.urljoin(elem))

Simply put, this method expects some values from parse_from_innerpage(self, link) as it "yields" or iterates through the receives list. However, your parse_from_innerpage(self,link) doesn't return anything - take a look, there is no return statement!:

def parse_from_innerpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a")))
        
        #Use return instead of print()
        print(elem.text)

As parse_from_innerpage(**args) returns None, parse(**args) won't return anything either, as there is nothing to iterate over/return. Hence you should replace print() with return.


I suggest looking at the Scrapy Documentation (especially the Scrapy at a glance) to understand how Scrapy exactly works and what it expects a scrapy.Spider to do in order to achieve your goals. Basically the parse(**args) method is used as a generator (referring to the already mentioned StackOverflow question) for the results you want to obtain - once no elements are there to iterate over (it stops) it "shows them to you":

In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback

... but as mentioned in your case parse(**args) unfortunately receives None due to print() instead of return.

J. M. Arnold
  • 6,261
  • 3
  • 20
  • 38
1

yield in python is used from inside a generator to "give up" on running and return a value.

so here the parse method returns the result of parse_from_innerpage. However, parse_from_innerpage does not have a return statement, which means it returns None.

Read this from the scrapy documentation about what scrapy expects a spider to do.

In short, scrapy uses the parse method as a generator for the results and then shows them to you when it stops running (runs out of links to scrape). Replace the print with returns and everything should work as expected

OranShuster
  • 476
  • 2
  • 12