I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.
I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response)
method to process artist data, parse_album(response)
method to process album data.
My question is, If a page contains both artist and album data, how should I define my rules?
- Shoud I do like below? (Two rules for same url pattern)
- Should I multiple callbacks? (Does scrapy support multiple callbacks?)
Is there other way to do it. (A proper way)
class ExampleSpider(CrawlSpider): name = 'example' start_urls = ['http://www.example.com'] rules = [ Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True), Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True), # more rules ..... ] def parse_artist(self, response): artist_item = ArtistItem() try: # do the scrape and assign to ArtistItem except Exception: # ignore for now pass return artist_item pass def parse_album(self, response): album_item = AlbumItem() try: # do the scrape and assign to AlbumItem except Exception: # ignore for now pass return album_item pass pass