7

Assume I have an scraped item that looks like this

{
    name: "Foo",
    country: "US",
    url: "http://..."
}

In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like

class MyPipeline(object):
    def process_item(self, item, spider):
        request(item['url'], function(response) {
           if (...) {
             raise DropItem()
           }
           return item
        }, function(error){ 
            raise DropItem()
        })

Smells like this is not possible using pipelines. What do you think? Any ideas how to achieve this?

The spider:

import scrapy
import json

class StationSpider(scrapy.Spider):
    name = 'station'
    start_urls = ['http://...']

    def parse(self, response):
        jsonResponse = json.loads(response.body_as_unicode())
        for station in jsonResponse:
            yield station
DarkLeafyGreen
  • 69,338
  • 131
  • 383
  • 601

1 Answers1

10

Easy way

import requests

def process_item(self, item, spider):
    response = requests.get(item['url'])
    if r.status_code ...:
        raise DropItem()
    elif response.text ...:
        raise DropItem()
    else:
        return item

Scrapy way

Now I think you shouldn't do this inside a Pipeline, you should treat it inside the spider not yielding an item but a request and then yielding the item.

Now if you still want to include a scrapy Request inside a pipeline you could do something like this:

class MyPipeline(object):

    def __init__(self, crawler):
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_item(self, item, spider):
        ...
        self.crawler.engine.crawl(
                    Request(
                        url='someurl',
                        callback=self.custom_callback,
                    ),
                    spider,
                )

        # you have to drop the item, and send it again after your check
        raise DropItem()
    # YES, you can define a method callback inside the same pipeline
    def custom_callback(self, response):
        ...
        yield item

Check that we are emulating the same behaviour of spider callbacks inside the pipeline. You need to figure out a way to always drop the items when you want to do an extra request, and just pass the ones that are being by the extra callback.

One way could be sending different types of items, and check them inside the process_item of the pipeline:

def process_item(self, item, spider):
    if isinstance(item, TempItem):
        ...
    elif isinstance(item, FinalItem):
        yield item
elcombato
  • 473
  • 1
  • 4
  • 16
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Can you show some code for doing it the scrapy way inside the spider? It seems to be the correct solution. – DarkLeafyGreen Jul 19 '16 at 22:08
  • you'll have to share the code of your spider (or at least the part where you are yielding the items with those urls you want to check later) – eLRuLL Jul 19 '16 at 22:16
  • you can get the crawler from the spider passed in to process_item, no need to init? – lucid_dreamer Feb 02 '21 at 02:04
  • Note if you need to build an absolute url from the response, you don't have the response available here... I.e. no response.url_join – lucid_dreamer Feb 02 '21 at 02:05