0

I am trying this sample code

from scrapy.spiders import Spider, Request  
import scrapy

class MySpider(Spider):

    name = 'toscrapecom'
    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

    urls = (
        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
    )

    def parse(self, response):
        for url in self.urls:
            return Request(url)

It crawls all the pages fine. However if I yield an item before the for loop then it crawls only the first page. (as shown below)

from scrapy.spiders import Spider, Request  
import scrapy

class MySpider(Spider):

    name = 'toscrapecom'
    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']

    urls = (
        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
    )

    def parse(self, response):
        yield scrapy.item.Item()
        for url in self.urls:
            return Request(url)

But I can use yield Request(url) instead of return... and it scrapes the pages backwards from last page to first.

I would like to understand why return does not work anymore once an item is yielded? Can somebody explain this in a simple way?

Evren Yurtesen
  • 2,267
  • 1
  • 22
  • 40

2 Answers2

1

You ask why the second code does not work, but I don’t think you fully understand why the first code works :)

The for loop of your first code only loops once.

What is happening is:

  1. self.parse() is called for the URL in self.start_urls.

  2. self.parse() gets the first (and only the first!) URL from self.urls, and returns it, exiting self.parse().

  3. When a response for that first URL arrives, self.parse() gets called again, and this time it returns a request (only 1 request!) for the second URL from self.urls, because the previous call to self.parse() already consumed the first URL from it (self.urls is an iterator).

The last step repeats in a loop, but it is not the for loop that does it.

You can change your original code to this and it will work the same way:

def parse(self, response):
    try:
        return next(self.urls)
    except StopIteration:
        pass
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
  • Your explanation is very detailed but it still does not explain why `yield` stops execution. Shouldn't it return a `return Request(url)` which will in return reach for loop again? Why it stops in first iteration? As far as I understand, yield does raise `StopIteration`? – Evren Yurtesen Apr 30 '19 at 21:20
0

Because to call items/requests it should be generator function. You even cannot use yield and return in the same function with the same "meaning", it will raise SyntaxError: 'return' with argument inside generator.

The return is (almost) equivalent to raising StopIteration. In this topic Return and yield in the same function you can find very detailed explanation, with links specification.

vezunchik
  • 3,669
  • 3
  • 16
  • 25
  • It is not clear to me why my problem should be with `return` raising `StopIteration` as the method works normally, unless I use `yield`. So if the problem was due to `StopIteration` exception then shouldn't the first example crawl only first url and stop? – Evren Yurtesen Apr 19 '19 at 17:27