4

My spider function is on a page and I need to go to a link and get some data from that page to add to my item but I need to go to various pages from the parent page without creating more items. How would I go about doing that because from what I can read in the documentation I can only go in a linear fashion:

  parent page > next page > next page

But I need to:

  parent page > next page
              > next page
              > next page
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
William Young
  • 125
  • 1
  • 10

2 Answers2

4

You should return Request instances and pass item around in meta. And you would have to make it in a linear fashion and build a chain of requests and callbacks. In order to achieve it, you can pass around a list of requests for completing an item and return an item from the last callback:

def parse_main_page(self, response):
    item = MyItem()
    item['main_url'] = response.url

    url1 = response.xpath('//a[@class="link1"]/@href').extract()[0]
    request1 = scrapy.Request(url1, callback=self.parse_page1)

    url2 = response.xpath('//a[@class="link2"]/@href').extract()[0]
    request2 = scrapy.Request(url2, callback=self.parse_page2)

    url3 = response.xpath('//a[@class="link3"]/@href').extract()[0]
    request3 = scrapy.Request(url3, callback=self.parse_page3)

    request.meta['item'] = item
    request.meta['requests'] = [request2, request3]
    return request1

def parse_page1(self, response):
    item = response.meta['item']
    item['data1'] = response.xpath('//div[@class="data1"]/text()').extract()[0]

    return request.meta['requests'].pop(0)

def parse_page2(self, response):
    item = response.meta['item']
    item['data2'] = response.xpath('//div[@class="data2"]/text()').extract()[0]

    return request.meta['requests'].pop(0)

def parse_page3(self, response):
    item = response.meta['item']
    item['data3'] = response.xpath('//div[@class="data3"]/text()').extract()[0]

    return item

Also see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you, this structure will have to work. *cracks knucles*, I'm guessing it has to do with the Reactor that scrapy uses? – William Young Nov 16 '14 at 22:59
  • @Barfe yeah, this is due to asynchronous nature of scrapy. You cannot predict which request would be completed before another one, or, in other words, you cannot know form which callback to return an item. – alecxe Nov 16 '14 at 23:02
  • @Barfe also, check out [this answer](http://stackoverflow.com/a/25571270/771848) - it is exactly about your use case. It is basically the same idea as I've proposed, but in a more elegant fashion. – alecxe Nov 16 '14 at 23:07
1

Using the Scrapy Requests you can perform extra operations on the next URL in the scrapy.Request's callback .