3

Now I can see that scrapy downloads all pages concurrently, but what I need is to chain people and extract_person methods, so that when I get list of persons urls in method people I follow all of them and scrape all info I need and only after that I continue with another page people urls. How can I do that?

def people(self, response):
    sel = Selector(response)
    urls = sel.xpath(XPATHS.URLS).extract()
    for url in urls:
        yield Request(
            url=BASE_URL+url,
            callback=self.extract_person,
        )

def extract_person(self, response):
    sel = Selector(response)
    name = sel.xpath(XPATHS.NAME).extract()[0]
    person = PersonItem(name=name)
    yield student
Dmitrii Mikhailov
  • 5,053
  • 7
  • 43
  • 69

1 Answers1

3

You can control the priority of the requests:

priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

Setting the priority for person requests to 1 will let Scrapy know to process them first:

for url in student_urls:
    yield Request(
        url=BASE_URL+url,
        callback=self.extract_person,
        priority=1
    )
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • As I can see in logs, it's still pretty much the same. There are still a lot of `people` method executions before `extract_person`. – Dmitrii Mikhailov Nov 06 '14 at 16:49
  • 1
    @DmitryMikhaylov yeah, they are already probably on the queue due to how `start_urls` are handled internally. Give a try to [this solution](http://stackoverflow.com/a/9176662/771848) - override `start_requests()` method and return a list of urls from it. Thanks. Let me know if it helps. – alecxe Nov 06 '14 at 16:53
  • Yes, but isn't the priority effect how they are processed within the queue. I also could not get priority keyword to work properly in my crawlers. – Evren Yurtesen Apr 17 '19 at 16:12