Separate output file for every url given in start_urls list of spider in scrapy

Question

I want to create separate output file for every url I have set in start_urls of spider or somehow want to split ouput files start url wise.

Following is the start_urls of my spider

start_urls = ['http://www.dmoz.org/Arts/', 'http://www.dmoz.org/Business/', 'http://www.dmoz.org/Computers/']

I want to create separate output file like

Arts.xml
Business.xml
Computers.xml

I don't know exactly how to do this. I am thinking to achieve this by implementing some thing like following in spider_opened method of item pipeline class,

import re
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class CleanDataPipeline(object):
    def __init__(self):
        self.cnt = 0
        self.filename = ''

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        referer_url = response.request.headers.get('referer', None)
        if referer_url in spider.start_urls:
            catname = re.search(r'/(.*)$', referer_url, re.I)
            self.filename = catname.group(1)

        file = open('output/' + str(self.cnt) + '_' + self.filename + '.xml', 'w+b')
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        #file.close()

    def process_item(self, item, spider):
        self.cnt = self.cnt + 1
        self.spider_closed(spider)
        self.spider_opened(spider)
        self.exporter.export_item(item)
        return item

Where I am trying to find the referer url of every scraped item within the start_urls list. If referer url is found in start_urls then file name will be created using that referer url. But problem is how to access response object inside spider_opened() method. If I can access it there, I can create file based on that.

Any help to find a way to perform this? Thanks in advance!

score 6 · Accepted Answer · answered May 26 '14 at 17:17

I'd implement a more explicit approach (not tested):

configure list of possible categories in settings.py:
```
CATEGORIES = ['Arts', 'Business', 'Computers']
```

define your start_urls based on the setting

start_urls = ['http://www.dmoz.org/%s' % category for category in settings.CATEGORIES]

add category Field to the Item class

in the spider's parse method set the category field according to the current response.url, e.g.:

def parse(self, response):
     ...
     item['category'] = next(category for category in settings.CATEGORIES if category in response.url)
     ...

in the pipeline open up exporters for all categories and choose which exporter to use based on the item['category']:

def spider_opened(self, spider):
    ...
    self.exporters = {}
    for category in settings.CATEGORIES:
        file = open('output/%s.xml' % category, 'w+b')
        exporter = XmlItemExporter(file)
        exporter.start_exporting()
        self.exporters[category] = exporter

def spider_closed(self, spider):
    for exporter in self.exporters.itervalues(): 
        exporter.finish_exporting()

def process_item(self, item, spider):
    self.exporters[item['category']].export_item(item)
    return item

You would probably need to tweak it a bit to make it work but I hope you got the idea - store the category inside the item being processed. Choose a file to export to based on the item category value.

Hope that helps.

score 2 · Answer 2 · answered May 26 '14 at 17:13

As long as you don't store it in the item itself, you can't really know the staring url. The following solution should work for you:

redefine the make_request_from_url to send the starting url with each Request you make. You can store it in meta attribute of your Request. Bypass this starting url with each following Request.
as soon as you decide to pass the element to pipeline, fill in the starting url for the item from response.meta['start_url']

Hope it helps. Following links may be helpful:

http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.make_requests_from_url

http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=meta#passing-additional-data-to-callback-functions

score 0 · Answer 3 · answered Feb 10 '20 at 11:36

Here's how I did it for my project without setting the category in item:

Pass the argument from the command line like this:

scrapy crawl reviews_spider -a brand_name=apple

Receive the argument and set to spider args in the my_spider.py

def __init__(self, brand_name, *args, **kwargs):
    self.brand_name = brand_name
    super(ReviewsSpider, self).__init__(*args, **kwargs)

    # i am reading start_urls from an external file depending on the passed argument
    with open('make_urls.json') as f:
        self.start_urls = json.loads(f.read())[self.brand_name]

In pipelines.py:

class ReviewSummaryItemPipeline(object):
    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        # change the output file name based on argument
        self.file = open(f'reviews_summary_{spider.brand_name}.csv', 'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Separate output file for every url given in start_urls list of spider in scrapy

3 Answers3

Linked