101

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.

Thanks

Acorn
  • 49,061
  • 27
  • 133
  • 172
CodeMonkeyB
  • 2,970
  • 4
  • 22
  • 29
  • 3
    Thank you for your very good question. Please select an answer for all future googlers. The answer provided by mstringer worked very well for me. – symbiotech Dec 01 '13 at 17:37

11 Answers11

175

Just remove all pipelines from main settings and use this inside spider.

This will define the pipeline to user per spider

class testSpider(InitSpider):
    name = 'test'
    custom_settings = {
        'ITEM_PIPELINES': {
            'app.MyPipeline': 400
        }
    }
Mirage
  • 30,868
  • 62
  • 166
  • 261
  • 7
    for the one who is wondering what the '400' is ? like me - FROM THE DOC - "The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range" - https://docs.scrapy.org/en/latest/topics/item-pipeline.html – brainLoop Mar 28 '19 at 18:36
  • 6
    Not sure why this isn't the accepted answer, works perfectly, much cleaner and simpler than accepted answer. This is exactly what I was looking for. Still working in scrapy 1.8 – Eric F Nov 27 '19 at 01:36
  • 4
    Just checked in scrapy 1.6. It isn't necessary to remove pipeline settings in settings.py. custom_settings in the spider override pipeline settings in settings.py. – Graham Monkman Dec 20 '19 at 14:51
  • Works perfectly for my scenario! – Mark Kamyszek Jan 22 '20 at 02:42
  • 1
    for 'app.MyPipeline' replace the full name of the pipeline class. Eg, project.pipelines.MyPipeline where project is the name of the project, pipelines is the pipelines.py file and MyPipeline is the Pipeline class – Nava Bogatee Oct 05 '20 at 11:06
  • I just used this answer in the 2.# version. It works perfectly. It also overrides the pipeline in Settings. – crianopa Jan 22 '21 at 05:30
  • What do you do if you have one Spider that exports multiple different types of items? In my use case I need to crawl several different links on a page and each link has different kinds of data that I need to scrape. – Evan Zamir Sep 12 '21 at 06:14
  • The only issue in my view is difficulty maintaining the project if you have many spiders that all need to use one pipeline definition. To get around this, appropriate pipeline definitions can simply be imported from a central file outside the spiders. I've added an example in my answer. – shawn caza Nov 28 '22 at 14:32
39

Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example:

def check_spider_pipeline(process_item_method):

    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)

        # if class is in the spider's pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            spider.log(msg % 'executing', level=log.DEBUG)
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return item

    return wrapper

For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:

class MySpider(BaseSpider):

    pipeline = set([
        pipelines.Save,
        pipelines.Validate,
    ])

    def parse(self, response):
        # insert scrapy goodness here
        return item

And then in a pipelines.py file:

class Save(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do saving here
        return item

class Validate(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do validating here
        return item

All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).

mstringer
  • 2,242
  • 3
  • 25
  • 36
  • 1
    I am trying to implement your way of switching between pipelines, I'm getting NameError though! I get pipelines is not defined. have you tested this code yourself? would you help me? – mehdix_ Apr 03 '15 at 17:48
  • .@mehdix_ yes, it works for me. Where do you get a NameError? – mstringer Apr 06 '15 at 17:16
  • The error comes right after `scrapy crawl ` command. python does not recognize the names I set within the spider class in order for pipelines to run. I will give you links to my [spider.py](http://pastebin.com/eK5FytEt) and [pipeline.py](http://pastebin.com/RXNX4h8r) for you to take a look. Thanks – mehdix_ Apr 07 '15 at 02:39
  • I think this is a python import issue: you have one Pipeline defined in `pipelines.py` and one named Insert defined in your `spiders.py`: there isn't one called `pipeline.Insert` as you have. I think just changing `pipeline.Insert` to `Insert` when you define the list of pipelines will do the trick. – mstringer Apr 07 '15 at 18:45
  • Its either scrapy that changed its layout design or my imports are not really backward compatible. The interpreter this time doesn't like the `class Save(BasePipeline)`, it throws the nameError again that BasePipeline is not defined. I guess there should be an import on top saying `from scrapy.pipeline import BasePipeline` am I right? – mehdix_ Apr 07 '15 at 21:45
  • Just edited answer to clarify. The important thing is to be sure that your Pipeline objects can be imported. – mstringer Apr 08 '15 at 04:12
  • 1
    Thanks for clarification. where does the first code snippet go? somewhere at the end of the `spider.py` right? – mehdix_ Apr 08 '15 at 04:40
  • 1
    I edited the condition not to fail on already defined spiders that have no pipeline set, this will also make it execute all pipelines by default unless told otherwise. `if not hasattr(spider, 'pipeline') or self.__class__ in spider.pipeline:` – Nour Wolf Jul 16 '15 at 20:51
  • Hi @mstringer could you please provide an update and which part of the solution where exctly to be put. I'm trying to implement it, but without any success. – default_settings Sep 02 '20 at 15:32
17

The other solutions given here are good, but I think they could be slow, because we are not really not using the pipeline per spider, instead we are checking if a pipeline exists every time an item is returned (and in some cases this could reach millions).

A good way to completely disable (or enable) a feature per spider is using custom_setting and from_crawler for all extensions like this:

pipelines.py

from scrapy.exceptions import NotConfigured

class SomePipeline(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):
            # if this isn't specified in settings, the pipeline will be completely disabled
            raise NotConfigured
        return cls()

    def process_item(self, item, spider):
        # change my item
        return item

settings.py

ITEM_PIPELINES = {
   'myproject.pipelines.SomePipeline': 300,
}
SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default

spider1.py

class Spider1(Spider):

    name = 'spider1'

    start_urls = ["http://example.com"]

    custom_settings = {
        'SOMEPIPELINE_ENABLED': False
    }

As you check, we have specified custom_settings that will override the things specified in settings.py, and we are disabling SOMEPIPELINE_ENABLED for this spider.

Now when you run this spider, check for something like:

[scrapy] INFO: Enabled item pipelines: []

Now scrapy has completely disabled the pipeline, not bothering of its existence for the whole run. Check that this also works for scrapy extensions and middlewares.

Tony
  • 9,672
  • 3
  • 47
  • 75
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
15

You can use the name attribute of the spider in your pipeline

class CustomPipeline(object)

    def process_item(self, item, spider)
         if spider.name == 'spider1':
             # do something
             return item
         return item

Defining all pipelines this way can accomplish what you want.

pad
  • 2,296
  • 2
  • 16
  • 23
12

I can think of at least four approaches:

  1. Use a different scrapy project per set of spiders+pipelines (might be appropriate if your spiders are different enough warrant being in different projects)
  2. On the scrapy tool command line, change the pipeline setting with scrapy settings in between each invocation of your spider
  3. Isolate your spiders into their own scrapy tool commands, and define the default_settings['ITEM_PIPELINES'] on your command class to the pipeline list you want for that command. See line 6 of this example.
  4. In the pipeline classes themselves, have process_item() check what spider it's running against, and do nothing if it should be ignored for that spider. See the example using resources per spider to get you started. (This seems like an ugly solution because it tightly couples spiders and item pipelines. You probably shouldn't use this one.)
Francis Avila
  • 31,233
  • 6
  • 58
  • 96
  • thanks for your response. I was using method 1 but i feel having one project is cleaner and allows me to reuse code. can you please elaborate more on method 3. How would i isolate spiders into their own tool commands? – CodeMonkeyB Dec 04 '11 at 05:26
  • According to the link posted on another answer, you can't override pipelines so I guess number 3 wouldn't work. – Daniel Bang Aug 12 '12 at 00:49
  • could you help me here pleaes? http://stackoverflow.com/questions/25353650/scrapy-how-to-import-the-settings-to-override-it – Marco Dinatsoli Aug 17 '14 at 21:17
11

The most simple and effective solution is to set custom settings in each spider itself.

custom_settings = {'ITEM_PIPELINES': {'project_name.pipelines.SecondPipeline': 300}}

After that you need to set them in the settings.py file

ITEM_PIPELINES = {
   'project_name.pipelines.FistPipeline': 300,
   'project_name.pipelines.SecondPipeline': 400
}

in that way each spider will use the respective pipeline.

default_settings
  • 440
  • 1
  • 5
  • 10
6

You can just set the item pipelines settings inside of the spider like this:

class CustomSpider(Spider):
    name = 'custom_spider'
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.PagePipeline': 400,
            '__main__.ProductPipeline': 300,
        },
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2
    }

I can then split up a pipeline (or even use multiple pipelines) by adding a value to the loader/returned item that identifies which part of the spider sent items over. This way I won’t get any KeyError exceptions and I know which items should be available.

    ...
    def scrape_stuff(self, response):
        pageloader = PageLoader(
                PageItem(), response=response)

        pageloader.add_xpath('entire_page', '/html//text()')
        pageloader.add_value('item_type', 'page')
        yield pageloader.load_item()

        productloader = ProductLoader(
                ProductItem(), response=response)

        productloader.add_xpath('product_name', '//span[contains(text(), "Example")]')
        productloader.add_value('item_type', 'product')
        yield productloader.load_item()

class PagePipeline:
    def process_item(self, item, spider):
        if item['item_type'] == 'product':
            # do product stuff

        if item['item_type'] == 'page':
            # do page stuff
Ryan Stefan
  • 124
  • 1
  • 3
1

I am using two pipelines, one for image download (MyImagesPipeline) and second for save data in mongodb (MongoPipeline).

suppose we have many spiders(spider1,spider2,...........),in my example spider1 and spider5 can not use MyImagesPipeline

settings.py

ITEM_PIPELINES = {'scrapycrawler.pipelines.MyImagesPipeline' : 1,'scrapycrawler.pipelines.MongoPipeline' : 2}
IMAGES_STORE = '/var/www/scrapycrawler/dowload'

And bellow complete code of pipeline

import scrapy
import string
import pymongo
from scrapy.pipelines.images import ImagesPipeline

class MyImagesPipeline(ImagesPipeline):
    def process_item(self, item, spider):
        if spider.name not in ['spider1', 'spider5']:
            return super(ImagesPipeline, self).process_item(item, spider)
        else:
           return item 

    def file_path(self, request, response=None, info=None):
        image_name = string.split(request.url, '/')[-1]
        dir1 = image_name[0]
        dir2 = image_name[1]
        return dir1 + '/' + dir2 + '/' +image_name

class MongoPipeline(object):

    collection_name = 'scrapy_items'
    collection_url='snapdeal_urls'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        #self.db[self.collection_name].insert(dict(item))
        collection_name=item.get( 'collection_name', self.collection_name )
        self.db[collection_name].insert(dict(item))
        data = {}
        data['base_id'] = item['base_id']
        self.db[self.collection_url].update({
            'base_id': item['base_id']
        }, {
            '$set': {
            'image_download': 1
            }
        }, upsert=False, multi=True)
        return item
Nanhe Kumar
  • 15,498
  • 5
  • 79
  • 71
1

we can use some conditions in pipeline as this

    # -*- coding: utf-8 -*-
from scrapy_app.items import x

class SaveItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, x,):
            item.save()
        return item
Wade
  • 11
  • 1
1

Overriding 'ITEM_PIPELINES' with custom settings per spider, as others have suggested, works well. However, I found I had a few distinct groups of pipelines I wanted to use for different categories of spiders. I wanted to be able to easily define the pipeline for a particular category of spider without a lot of thought, and I wanted to be able to update a pipeline category without editing each spider in that category individually.

So I created a new file called pipeline_definitions.py in the same directory as settings.py. pipeline_definitions.py contains functions like this:

def episode_pipelines():
    return {
        'radio_scrape.pipelines.SaveEpisode': 100,
    }

def show_pipelines():
    return {
        'radio_scrape.pipelines.SaveShow': 100,
    }

Then in each spider I would import the specific function relevant for the spider:

from radio_scrape.pipeline_definitions import episode_pipelines

I then use that function in the custom settings assignment:

class RadioStationAEspisodesSpider(scrapy.Spider):
    name = 'radio_station_A_episodes'        
    custom_settings = {
        'ITEM_PIPELINES': episode_pipelines()
    }
shawn caza
  • 342
  • 2
  • 13
0

Simple but still useful solution.

Spider code

    def parse(self, response):
        item = {}
        ... do parse stuff
        item['info'] = {'spider': 'Spider2'}

pipeline code

    def process_item(self, item, spider):
        if item['info']['spider'] == 'Spider1':
            logging.error('Spider1 pipeline works')
        elif item['info']['spider'] == 'Spider2':
            logging.error('Spider2 pipeline works')
        elif item['info']['spider'] == 'Spider3':
            logging.error('Spider3 pipeline works')

Hope this save some time for somebody!

NashGC
  • 659
  • 8
  • 17
  • This does not scale very well, and also makes the code messy. With mixing responsibilities together. – godzsa Mar 13 '21 at 16:38