Scrapy pipeline architecture - need to return variables

Question

I need some advice on how to proceed with my item pipeline. I need to POST an item to an API (working well) and with the response object get the ID of the entity created (have this working too) and then use it to populate another entity. Ideally, the item pipeline can return the entity ID. Basically, I am in a situation where I have a one to many relationship that I need to encode in a no-SQL database. What would be the best way to proceed?

I'm not sure I understand your question. "Ideally, the item pipeline can return the entity ID" If you just need to return an entityID with your item from the item pipeline why not just add a 'entity_meta' attribute to your item and populate it before returning from item pipeline? Then in a l — rocktheartsm4l, Aug 07 '14 at 23:07

score 0 · Answer 1 · answered Aug 06 '14 at 23:37

0

Perhaps I don't understand your question, but it sounds like you just need to call your submission code in the def close_spider(self, spider): method. Have you tried that?

answered Aug 06 '14 at 23:37

Bee Smears

803
3
12
22

Not sure that would work. I am scraping a page that has a list of dealerships. Each dealership has its own page that lists all the cars that it has in stock. Each car has its own page with a list of options (A/C, Transmission, etc...). I have to populate three tables: 1) Dealers; 2) Brands (for each dealer); and 3) available cars (for each brand at a given dealer. Right now I have 3 corresponding items each with their own pipeline. Can your approach still be used? – MoreScratch Aug 07 '14 at 02:24
@MoreScratch this is **not** the right answer for everyone but, given my limited need/use of Scrapy, I have merely handled the additional post-scraping processes (normalizing, batching, and submitting data, for instance) separate from the framework. Thus, I simply import the modules I need to complete the post-scraping processes into pipelines.py and then call the initial script used in that process. I only rely on scrapy for its encoding, requests, and responses. Everything else is my own python/xpath/mysql/etc. – Bee Smears Aug 07 '14 at 14:11

score 0 · Accepted Answer · edited May 23 '17 at 12:20

The best way to proceed for you is to use Mongodb, a NO-sql databse which runs best in compliance with scrapy. The pipeline for the mongodb can be found here and the the process is explained in this tutorial .

Now what is explained in the solution from Pablo Hoffman, updating different items from different pipelines into one can be achieved by the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. (Not tested the code but hope it would help)

def check_spider_pipeline(process_item_method):

    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)

        # if class is in the spider's pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            spider.log(msg % 'executing', level=log.DEBUG)
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return item

    return wrapper

And the decorator goes something like this :

class MySpider(BaseSpider):

    pipeline = set([
        pipelines.Save,
        pipelines.Validate,
    ])

    def parse(self, response):
        # insert scrapy goodness here
        return item

class Save(BasePipeline):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # more scrapy goodness here
        return item

At last you can take help from this question.

Scrapy pipeline architecture - need to return variables

2 Answers2