Running Scrapy tasks in Python

Question

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:

"ReactorNotRestartable"

Why?

The offending code (last line throws the error):

crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)

# start engine scrapy/twisted
crawler.start()

jro · Accepted Answer · 2011-11-04T10:02:14.447

Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.

So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.

Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process

class CrawlerWorker(Process):
    def __init__(self, spider, results):
        Process.__init__(self)
        self.results = results

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.results.put(self.items)

# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
    pass # Do something with item

adding crawler.stop() immediately after crawler.start() didn't help - how do I discover the "appropriate time"? — Trindaz, Nov 03 '11 at 22:42
@Trindaz: I wasn't correct on that call, please see the updated answer. — jro, Nov 04 '11 at 10:02
Thanks for the update @jro. I've seen this snippet before too and, if I've interpreted that correctly, then the concept is you can scrape as much as you want by adding spiders to a crawler that never dies, rather than trying to restart a crawler for every attempt you make at "executing" a spider. I've marked this as a solution because technically it solves my problem but is unusable to me because I don't want to rely to persistent crawler objects in the django application I'm using this in. I ended up writing a solution based purely on BeautifulSoup and urllib2. — Trindaz, Nov 05 '11 at 22:29
I'd guess you found it here... http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ — John Mee, Jun 05 '12 at 07:30
Will this still run items through the pipelines defined in the settings? — Sam Stoelinga, Oct 04 '12 at 09:31
hey with this I'm getting error with scrapy.conf, it seems deprecated? — KJW, Apr 08 '13 at 19:41

score 1 · Answer 2 · answered Nov 03 '11 at 11:33

1

crawler.start() starts Twisted reactor. There can be only one reactor.

If you want to run more spiders - use

another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)

answered Nov 03 '11 at 11:33

warvariuc

57,116
41
173
227

scrapy 0.14 does not support multiple spiders in a crawlerprocess anymore. – goh Dec 27 '11 at 03:17
haven't tested, but this might work (from looking at the source code): `crawler.engine.open_spider(another_spider)` – warvariuc Dec 27 '11 at 06:49
why would you want to stop reactor? – warvariuc Dec 28 '11 at 18:15
sending a ctrl-c interrupt signal doesn't close the spiders – goh Dec 29 '11 at 02:09
yea it did.. but i also ran into some problems with handling the spider_opened and spider_closed signals in my pipeline.. idk, says http://tinyurl.com/cpg55xp that it might need to configure the reactor? – goh Dec 29 '11 at 10:00

score 0 · Answer 3 · edited May 23 '17 at 10:29

0

I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.

Thread(target=process.start).start()

Here is the detailed explanation: Run a Scrapy spider in a Celery Task

edited May 23 '17 at 10:29

Community

1
1

answered Mar 24 '16 at 09:03

Denis Cherniatev

61
4

score -1 · Answer 4 · answered Nov 03 '11 at 11:16

-1

Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

answered Nov 03 '11 at 11:16

Joël

2,723
18
36

Running Scrapy tasks in Python

4 Answers4

Linked