When you start a scrapy project, you get a directory tree like this:
$ scrapy startproject multipipeline
$ tree
.
├── multipipeline
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── example.py
│ └── __init__.py
└── scrapy.cfg
And the generated pipelines.py
looks like this:
$ cat multipipeline/pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class MultipipelinePipeline(object):
def process_item(self, item, spider):
return item
But your scrapy project can reference any Python class as item pipelines. One option is to convert the generated one-file pipelines
module to a package within its own directory, with submodules.
Notice the __init__.py
file inside the pipelines/
dir:
$ tree
.
├── multipipeline
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines
│ │ ├── __init__.py
│ │ ├── one.py
│ │ ├── three.py
│ │ └── two.py
│ ├── settings.py
│ └── spiders
│ ├── example.py
│ └── __init__.py
└── scrapy.cfg
The individual modules inside the pipelines/
dir could look like this:
$ cat multipipeline/pipelines/two.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import logging
logger = logging.getLogger(__name__)
class MyPipelineTwo(object):
def process_item(self, item, spider):
logger.debug(self.__class__.__name__)
return item
You can read more about packages here.
The __init__.py
files are required to make Python treat the
directories as containing packages; this is done to prevent
directories with a common name, such as string, from unintentionally
hiding valid modules that occur later on the module search path. In
the simplest case, __init__.py
can just be an empty file, but it can
also execute initialization code for the package or set the __all__
variable, described later.
And your settings.py
would contain something like this:
ITEM_PIPELINES = {
'multipipeline.pipelines.one.MyPipelineOne': 100,
'multipipeline.pipelines.two.MyPipelineTwo': 200,
'multipipeline.pipelines.three.MyPipelineThree': 300,
}