5

I am using Python and Scrapy for this question.

I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.

So, using Scrapy, the idea in pseudo-code is:

links = getlinks(A)
for link in links:
    B = getpage(link)
    C = getpage(B)
    image = getimage(C)

However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = hxs.select('...')

    items = []
    for link in links:
        item = CustomItem()
        item['name'] = link.select('...')
        # TODO: Somehow I need to go two pages deep and extract an image.
        item['image'] = ....

How would I go about doing this?

(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)

Community
  • 1
  • 1
sdasdadas
  • 23,917
  • 20
  • 63
  • 148

1 Answers1

6

In scrapy the parse method needs to return a new Request if you need to issue more requests (useyield as scrapy works well with generators). Inside this request you can set a callback to the desired function (to be recursive just pass parse again). Thats the way to crawl into pages.

You can check this recursive crawler as example

Following your example, the change would be something like this:

def parse(self, response):
    b_pages_links = getlinks(A)
    for link in b_pages_links:
        yield Request(link, callback = self.visit_b_page)

def visit_b_page(self, response):
    url_of_c_page = ...
    yield Request(url_of_c_page, callback = self.visit_c_page)

def visit_c_page(self, response):
    url_of_image = ...
    yield Request(url_of_image, callback = self.get_image)

def get_image(self, response):
    item = CustomItem()
    item['name'] = ... # get image name
    item['image'] = ... # get image data
    yield item

Also check the scrapy documentation and these random code snippets. They can help a lot :)

Tony
  • 9,672
  • 3
  • 47
  • 75
Bruno Penteado
  • 2,234
  • 2
  • 23
  • 26
  • Thanks, this is awesome. If I wanted to create the `CustomItem` in the `parse` method, would I pass it using the meta property? – sdasdadas Jun 10 '13 at 01:47
  • Also, I want to return a list of items (`items = []`). How would I use the above and then, upon its completion, append the item to the list? – sdasdadas Jun 10 '13 at 01:49
  • The spider only function is to visit a page, extract and return the data (the final `yield item`). To aggregate data, like putting all the items in a list, you need to create a function in the `pipelines` module (this is a convention only). [This example pipeline](https://github.com/bcap/wikipedia-music/blob/master/crawler/crawler/pipelines.py) creates a dot file based on all the music genres that were crawled – Bruno Penteado Jun 10 '13 at 01:54
  • Also remember to declare your pipeline function in the `settings.py`, like in [here](https://github.com/bcap/wikipedia-music/blob/master/crawler/crawler/settings.py) – Bruno Penteado Jun 10 '13 at 01:58
  • And to solve the problem in creating the item in the parse and completing it in another functions, you can indeed use meta. This is documented [here](http://doc.scrapy.org/en/0.16/topics/request-response.html#passing-additional-data-to-callback-functions) – Bruno Penteado Jun 10 '13 at 02:01
  • You don't need to append the requests to a list. If there are more than one Item to return, just yield each of them in a for loop. – Capi Etheriel Jun 10 '13 at 23:24
  • @BrunoPolaco, what if the A page and B pages have some items I want to scrap too? How to set the callback in that case? – user3768495 Jun 18 '16 at 17:15