-1

For debugging purposes, I would like to get ONLY items that have been dropped on Scrapy (Python library), using the "raise DropItem" option.

I wanted to have this list because in the sanitization process, sometimes a page contains HTML errors and I would like to add these URLs to my spider's blacklist.

Émerson Felinto
  • 433
  • 7
  • 18
  • DropItem is logged by Scrapy, according to manual. So you can try playing with log formatters and filters. Refer to this question for some examples (there that exception is set to DEBUG level, you can select any other - maybe without a name, like 15, to easy find it - and then filter it in your log): https://stackoverflow.com/questions/13527921/scrapy-silently-drop-an-item – STerliakov Apr 01 '21 at 16:52

1 Answers1

0

Listen for the item_dropped signal:

import scrapy
import scrapy.signals
from scrapy.crawler import CrawlerProcess


class Spider(scrapy.Spider):
    name = 'spider'
    start_urls = ['http://example.com']

    def parse(self, response):
        yield {'url': response.url}


process = CrawlerProcess()

def item_dropped(item, response, spider):
    print(results)

process.crawl(Spider)
for p in process.crawlers:
    p.signals.connect(item_dropped, signal=scrapy.signals.item_dropped)

process.start()
Steven Almeroth
  • 7,758
  • 2
  • 50
  • 57