How to write python scrapy code for extracting url's present in sitemap of a site

Question

I'm trying to use this code to get list of urls in sitemap. when i run this, i see no results in the screen. could anyone tell me whats the problem or suggest me better one with good example. thanks in advance

class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

def parse(self, response):
    print response.url
    return Request(response.url, callback=self.parse_sitemap_url)

def parse_sitemap_url(self, response):
    # do stuff with your sitemap links

Simply because your code is actually doing nothing but calling the parse_sitemap_url() function, which does nothing. Also your class MySpider is not well formatted and has non-used class vars. Where did you get that code from? — hexerei software, Jan 24 '16 at 19:01
referring to this link actually.. http://stackoverflow.com/questions/22957267/scrapy-crawl-all-sitemap-links — Shiv18, Jan 24 '16 at 19:10
would be a bit too complex for my spare time right now - since that would be a real job. usually i like to help pointing out errors in code or giving suggestions, but actually the code above is not even close to finished. It is a class, that you can use in your own code, but you have no main() - no starting point - nothing, just a simple prototype code for a class design :( — hexerei software, Jan 24 '16 at 21:13

dataisbeautiful · Accepted Answer · 2016-01-25T14:43:27.303

This spider will get all the URLs from a sitemap and save them to a list. You can easily change it to output to a file or the console.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import SitemapSpider
from scrapy.spiders import Spider
from scrapy.http import Request, XmlResponse
from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots
from scrapy.utils.gz import gunzip, is_gzipped
import re
import requests

class GetpagesfromsitemapSpider(SitemapSpider):
    name = "test"
    handle_httpstatus_list = [404]

    def parse(self, response):
       print response.url

    def _parse_sitemap(self, response):
        if response.url.endswith('/robots.txt'):
            for url in sitemap_urls_from_robots(response.body):
                yield Request(url, callback=self._parse_sitemap)
        else:
            body = self._get_sitemap_body(response)
            if body is None:
                self.logger.info('Ignoring invalid sitemap: %s', response.url)
                return

            s = Sitemap(body)
            sites = []
            if s.type == 'sitemapindex':
                for loc in iterloc(s, self.sitemap_alternate_links):
                    if any(x.search(loc) for x in self._follow):
                        yield Request(loc, callback=self._parse_sitemap)
            elif s.type == 'urlset':
                for loc in iterloc(s):
                    for r, c in self._cbs:
                        if r.search(loc):
                            sites.append(loc)
                            break
            print sites

    def __init__(self, spider=None, *a, **kw):
            super(GetpagesfromsitemapSpider, self).__init__(*a, **kw)
            self.spider = spider
            l = []
            url = "https://channelstore.roku.com"
            resp = requests.head(url + "/sitemap.xml")
            if (resp.status_code != 404):
                l.append(resp.url)
            else:
                resp = requests.head(url + "/robots.txt")
                if (resp.status_code == 200):
                    l.append(resp.url)
            self.sitemap_urls = l
            print self.sitemap_urls

def iterloc(it, alt=False):
    for d in it:
        yield d['loc']

        # Also consider alternate URLs (xhtml:link rel="alternate")
        if alt and 'alternate' in d:
            for l in d['alternate']:
                yield l

thanks a ton for your help. but i want those url's to be added in to tuple or list variable. could you extend your help in this regard please. — Shiv18, Jan 25 '16 at 06:14
dude.. sorry for bothering you again.. but when i run this program, i get indentation error at the line "self.spider = spider". So i given one more indent to the above line "super(GetpagesfromsitemapSpider, self).__init__(*a, **kw)", then it didnt thrown any error. but in screen i see no results. i tried redirecting the result to txt file. still its empty.. could you tell me whats the possible reason for it. i run this program in linux. i use python 2.7 @dataisbeautiful — Shiv18, Jan 25 '16 at 10:50
dude.. i think u r done but with small logical errors.. when i debug, i added calling class line and i run.. i see the output the one i given as input.. when i inspect the code again, i see no calling statement for functions like parse and _parse_sitemap.. could recheck and tell me the perfect code.. thanks in advance.. @dataisbeautiful — Shiv18, Jan 26 '16 at 10:59
You need to add this spider to an existing scrapy project and then call it using scrapy crawl test. The code above works and returns all URLs from the sitemap on roku.com — dataisbeautiful, Jan 26 '16 at 11:03

How to write python scrapy code for extracting url's present in sitemap of a site

1 Answers1

Linked