I'm working on a sitemap spider. This spider gets one sitemap url and scrape all urls in this sitemap. I want to limit the number of urls to 100.
I can't use CLOSESPIDER_PAGECOUNT
because I use XML export pipeline.
It seems that when scrapy gets to the pagecount, it stops everything including XML generating. So the XML file is not closed etc. it's invalid.
class MainSpider(SitemapSpider):
name = 'main_spider'
allowed_domains = ['doman.com']
sitemap_urls = ['http://doman.com/sitemap.xml']
def start_requests(self):
for url in self.sitemap_urls:
yield Request(url, self._parse_sitemap)
def parse(self, response):
print u'URL: {}'.format(response.url)
if self._is_product(response):
URL = response.url
ITEM_ID = self._extract_code(response)
...
Do you know what to do?