Chaining requests with scrapy

Question

Now I can see that scrapy downloads all pages concurrently, but what I need is to chain people and extract_person methods, so that when I get list of persons urls in method people I follow all of them and scrape all info I need and only after that I continue with another page people urls. How can I do that?

def people(self, response):
    sel = Selector(response)
    urls = sel.xpath(XPATHS.URLS).extract()
    for url in urls:
        yield Request(
            url=BASE_URL+url,
            callback=self.extract_person,
        )

def extract_person(self, response):
    sel = Selector(response)
    name = sel.xpath(XPATHS.NAME).extract()[0]
    person = PersonItem(name=name)
    yield student

score 3 · Accepted Answer · answered Nov 06 '14 at 14:55

3

You can control the priority of the requests:

priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

Setting the priority for person requests to 1 will let Scrapy know to process them first:

for url in student_urls:
    yield Request(
        url=BASE_URL+url,
        callback=self.extract_person,
        priority=1
    )

answered Nov 06 '14 at 14:55

alecxe

462,703
120
1,088
1,195

As I can see in logs, it's still pretty much the same. There are still a lot of `people` method executions before `extract_person`. – Dmitrii Mikhailov Nov 06 '14 at 16:49
1

@DmitryMikhaylov yeah, they are already probably on the queue due to how `start_urls` are handled internally. Give a try to [this solution](http://stackoverflow.com/a/9176662/771848) - override `start_requests()` method and return a list of urls from it. Thanks. Let me know if it helps. – alecxe Nov 06 '14 at 16:53
Yes, but isn't the priority effect how they are processed within the queue. I also could not get priority keyword to work properly in my crawlers. – Evren Yurtesen Apr 17 '19 at 16:12

Chaining requests with scrapy

1 Answers1