Scrapy: How to limit number of urls scraped in SitemapSpider

Question

I'm working on a sitemap spider. This spider gets one sitemap url and scrape all urls in this sitemap. I want to limit the number of urls to 100.

I can't use CLOSESPIDER_PAGECOUNT because I use XML export pipeline. It seems that when scrapy gets to the pagecount, it stops everything including XML generating. So the XML file is not closed etc. it's invalid.

class MainSpider(SitemapSpider):
    name = 'main_spider'
    allowed_domains = ['doman.com']
    sitemap_urls = ['http://doman.com/sitemap.xml']

    def start_requests(self):
        for url in self.sitemap_urls:
            yield Request(url, self._parse_sitemap)


    def parse(self, response):
        print u'URL: {}'.format(response.url)
        if self._is_product(response):
            URL = response.url
            ITEM_ID = self._extract_code(response)

    ...

Do you know what to do?

score 1 · Answer 1 · answered Feb 05 '22 at 13:34

If you are using SitemapSpider, you can use sitemap_filter, which is a proper way to filter entries.

    limit = 5  # Limit to 5 entries only
    count = 0  # Entries counter

    def sitemap_filter(self, entries):
        for entry in entries:
            if self.count >= self.limit:
                continue
            self.count += 1
            yield entry

score 0 · Answer 2 · answered Nov 07 '20 at 22:45

Using on return was not enough for me, but you can combine it with the CloseSpider exception :

# To import it :
from scrapy.exceptions import CloseSpider


#Later to use it:
raise CloseSpider('message')

I posted the whole code combining both on stackoverflow here

score -1 · Answer 3 · answered Nov 05 '17 at 21:48

-1

Why not have a count property on the spider initialized to 0, and the on your parse method you can

def parse(self, response):
    if self.count >= 100:
         return
    self.count += 1
    # do actual parsing here

answered Nov 05 '17 at 21:48

omu_negru

4,642
4
27
38

I was considering this method but there is a problem that it still scrapes all urls, yes, it won't parse/export them, but the number of requests is the same i – Milano Nov 05 '17 at 21:49
you can just place the condition someplace else – omu_negru Nov 06 '17 at 07:07
@Milano did you find a solution for this problem? – Satyaaditya Aug 03 '20 at 09:54

Scrapy: How to limit number of urls scraped in SitemapSpider

3 Answers3

Linked