How does Scrapy proceed with the urls given in the urls variable under start_requests?

Question

Just wondering why when I have url = ['site1', 'site2'] and I run scrapy from script using .crawl() twice, in a row like

 def run_spiders():    
    process.crawl(Spider)
    process.crawl(Spider)

the output is:

site1info
site1info
site2info
site2info

as opposed to

site1info
site2info
site1info
site2info

score 0 · Answer 1 · edited May 29 '21 at 10:08

start_request uses the yield functionality. yield queues the requests. To understand it fully read this StackOverflow answer.

Here is the code example of how it works with start_urls in the start_request method.

start_urls = [
    "url1.com",
    "url2.com",
   ]    

 def start_requests(self):
    for u in self.start_urls:
        yield scrapy.Request(u, callback=self.parse)

For custom request ordering this priority feature can be used.

def start_requests(self):
    yield scrapy.Request(self.start_urls[0], callback=self.parse)
    yield scrapy.Request(self.start_urls[1], callback=self.parse, priority=1)

the one with the higher number of priority will be yielded first from the queue. By default, priority is 0.

Gallaecio · Answer 2 · 2019-02-26T16:36:11.370

0

Because as soon as you call process.start(), requests are handled asynchronously. The order is not guaranteed.

In fact, even if you only call process.crawl() once, you may sometimes get:

site2info
site1info

To run spiders sequentially from Python, see this other answer.

edited Feb 26 '19 at 16:36

answered Feb 25 '19 at 16:36

Gallaecio

3,620
2
25
64

I see. Is there a good way to repeatedly run several scrapes on a page then? I think my conceptual understanding is wrong; I was going to make ~5 spiders run about 1-2 seconds apart. Could I maybe make one spider "recrawl" every x seconds? New to scrapy... – Five9 Feb 26 '19 at 02:06
Does it has to happen in Python? You could, for example, use Bash to run `scrapy crawl ` in a `while` loop using `sleep` to pause the execution 2 seconds: `while true; do scrapy crawl ; sleep 2; done`. If you want to do it from Python, you will need to find out how to run spiders in sequence from the script (find questions about not being able to restart the Twisted reactor, a common issue when following that approach). – Gallaecio Feb 26 '19 at 10:25
I'm not familiar with bash; I assume it has to happen in Python because I want to do a a few calculations with the pulled data and then use Selenium to perform some web actions. – Five9 Feb 26 '19 at 16:12

How does Scrapy proceed with the urls given in the urls variable under start_requests?

2 Answers2