Assume I have an scraped item that looks like this
{
name: "Foo",
country: "US",
url: "http://..."
}
In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like
class MyPipeline(object):
def process_item(self, item, spider):
request(item['url'], function(response) {
if (...) {
raise DropItem()
}
return item
}, function(error){
raise DropItem()
})
Smells like this is not possible using pipelines. What do you think? Any ideas how to achieve this?
The spider:
import scrapy
import json
class StationSpider(scrapy.Spider):
name = 'station'
start_urls = ['http://...']
def parse(self, response):
jsonResponse = json.loads(response.body_as_unicode())
for station in jsonResponse:
yield station