retrieve scraped items from scrapy crawl when triggered via CrawlerRunner

Question

I have 2 spiders in a scrapy project. They work just fine and produce the required output items.

I want to execute these spiders in a background job in a web application.

Everything is setup - a Flask app with a background job setup using Redis - frontend waits for results - all is well.

Except i can't seem to work out how to get the resulting items from the spiders when they execute.

The closest i've come seems to be the answer to this question

Get Scrapy crawler output/results in script file function

but it seems to refer to an older version of scrapy (i'm using 1.4.0) and i get the deprecation warning

'ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762'

checking that github issue suggests this wouldn't have worked from around v1.1.0

So, can anyone tell me how to do this now?

In the GitHub issue referred, there's shown how to use the `from_crawler` method to connect signals in new versions of Scrapy. — Tomáš Linhart, Jun 22 '17 at 06:22
yes, that's right, but i'm calling the `CrawlerRunner` from within the background job. If i place the example `@classmethod` there it's not going to get called by the scrapy framework. — freeloader, Jun 22 '17 at 07:34

score 3 · Answer 1 · answered Jun 22 '17 at 07:57

Turns out it's pretty easy - must have been too late at night for me.

replace

from scrapy.xlib.pydispatch import dispatcher

with

from pydispatch import dispatcher

as it clearly says in the deprecation warning

otherwise import pydispatch directly if needed.

retrieve scraped items from scrapy crawl when triggered via CrawlerRunner

1 Answers1