3

I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.

I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response) method to process artist data, parse_album(response) method to process album data. My question is, If a page contains both artist and album data, how should I define my rules?

  1. Shoud I do like below? (Two rules for same url pattern)
  2. Should I multiple callbacks? (Does scrapy support multiple callbacks?)
  3. Is there other way to do it. (A proper way)

    class ExampleSpider(CrawlSpider):
        name = 'example'
    
        start_urls = ['http://www.example.com']
    
        rules = [
            Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
            Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
            # more rules .....
        ]
    
        def parse_artist(self, response):
            artist_item = ArtistItem()
            try:
                # do the scrape and assign to ArtistItem
            except Exception:
                # ignore for now
                pass
            return artist_item
            pass
    
        def parse_album(self, response):
            album_item = AlbumItem()
            try:
                # do the scrape and assign to AlbumItem
            except Exception:
                # ignore for now
                pass
            return album_item
            pass
        pass
    
Grainier
  • 1,634
  • 2
  • 17
  • 30

1 Answers1

9

The CrawlSpider calls _requests_to_follow() method to extract urls and generate requests to follow:

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        seen = seen.union(links)
        for link in links:
            r = Request(url=link.url, callback=self._response_downloaded)
            r.meta.update(rule=n, link_text=link.text)
            yield rule.process_request(r)

As you can see:

  • The variable seen memorizes urls has been processed.
  • Every url will be parsed by at most one callback.

You can define a parse_item() to call parse_artist() and parse_album():

rules = [
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
    # more rules .....
]

def parse_item(self, response):

    yield self.parse_artist(response)
    yield self.parse_album(response)
kev
  • 155,172
  • 47
  • 273
  • 272
  • `parse_artist(response)` will return a `ArtistItem()`, `parse_album(response)` will return a `AlbumItem()`. So if I'm using item pipelines (Let's say in-order to persist), Is it (persisting pipeline) going to be call twise with two types of data? – Grainier May 16 '14 at 14:02
  • @GrainierPerera pipelines will process every item one by one. – kev May 16 '14 at 14:29