3

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url
Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48
sulav_lfc
  • 772
  • 2
  • 14
  • 34

2 Answers2

3

Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url
        return Request(response.url, callback=self.parse_sitemap_url)

    def parse_sitemap_url(self, response):
        # do stuff with your sitemap links
Talvalin
  • 7,789
  • 2
  • 30
  • 40
3

You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:

class MySpider(SitemapSpider):
    name = 'xyz'
    sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
    # list with tuples - this example contains one page 
    sitemap_rules = [('/x/', parse_x)]

    def parse_x(self, response):
        sel = Selector(response)
        paragraph = sel.xpath('//p').extract()

        return paragraph
  • The `parse` method is called by default if no rules are specified; I believe the original post was correct in that regard – Herman Schaaf Jul 24 '15 at 10:59