Scrapy - Reactor not Restartable

Question

with:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

I've always ran this process sucessfully:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()

but since I've moved this code into a web_crawler(self) function, like so:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2)

and started calling the method using class instantiation, like:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

and running:

test()

I am getting the following error:

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

what is wrong?

Are you running "web_crawler()" more than once per script? You can't start a twisted reactor more than once. — Rejected, Jan 05 '17 at 22:06
not that I'm aware of. what I am doing is defining the crawler function in a class function, and running the process with a __call__ method. like: `results` = test.web_crawler(). — 8-Bit Borges, Jan 05 '17 at 22:09

score 65 · Accepted Answer · edited Mar 31 '22 at 20:16

65

You cannot restart the reactor, but you should be able to run it more times by forking a separate process:

import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

Run it twice:

configure_logging()

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

Result:

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

edited Mar 31 '22 at 20:16

Melroy van den Berg

2,697
28
31

answered Apr 27 '17 at 14:51

Ferrard

2,260
3
22
26

4

This solution works. Tested it with Jupyter (Google Colab). [⚠️BEWARE⚠️] There is one BIG caveat: You MUST restart your run-time when using this the first time. Else the bloated corpse of your previous reactor is still lingering around, and thus your forked processes will carry them over as well. After that, everything will run smoothly because the parent process will not touch it's own reactor anymore. – Domi Feb 07 '19 at 12:50
Thanks, it's works for me too,, btw, can you help to catch the result,, i'am stuck to get the result.. – Budi Mulyo Mar 09 '19 at 08:38
Sry i think it's cz my code.. `def parse(self, response):` and `def after_login(self, response):` – Budi Mulyo Mar 12 '19 at 02:36
Nope, it's still hard to catch in some variable.. XD – Budi Mulyo Mar 13 '19 at 10:08
I get a `AttributeError: 'PyDB' object has no attribute 'has_plugin_line_breaks` and a `Exception ignored in: '_pydevd_frame_eval.pydevd_frame_evaluator_darwin_37_64.get_bytecode_while_frame_eval` but it still works – PlsWork May 19 '19 at 00:58
19

I get an error when trying to run the code above: `AttributeError: Can't pickle local object 'run_spider..f'` – Jms Jul 23 '19 at 01:49
@Jms, not sure where the issue would be - the error seems to hint Python has trouble serializing the nested `f` method. I would check out the following question and its answer https://stackoverflow.com/questions/8804830/python-multiprocessing-picklingerror-cant-pickle-type-function – Ferrard Aug 06 '19 at 13:36
2

I noticed that the same code runs smoothly when running python inside WSL. So it seems to be an issue in python for windows. – Jms Aug 07 '19 at 03:18
Follow-up. Forking to another process creates a new PID for that thread. This means it doesn't have access to the original thread's variables/data and can't change them. However threading.Thread still gives the ReactorNotRestartable error. Is there a way to fix this so that it doesn't fork to another process? – Edgecase Sep 01 '19 at 03:23
3

Had small issue regarding `'AttributeError: Can't pickle local object 'run_spider..f'`, but moving function called `f` outside resolved my issue, and I could run the code – Sagynbek Kenzhebaev Mar 20 '20 at 14:04
You saved me big time brother, thanks! @Ferrard – Eternal Apr 02 '20 at 09:09
Is there any way to make this work on AWS Lambda? I'm running into a problem with the reactor not closing (normal AWS Lambda behavior..) when the Lambda function is cached... wanted to use this but Lambda doesn't support the Queue function.. – Cohen Apr 04 '20 at 05:12
@Cohen My current (hacky) way of getting this to run on AWS lambda is to use `sys.exit()`. This is not ideal, since it exits the python process that the AWS lambda execution environment would like to reuse for subsequent requests. As a result, there's an impact on warm up times for subsequent requests, which might be an issue for fan-out lambda usage. However, it does work: `process = CrawlerProcess(); d = process.crawl(MySpider); d.addCallback(lambda _: reactor.stop()); reactor.run()` – Scott McAllister Jun 28 '20 at 03:14
I couldn't get this to work for a spider on AWS Lambda nor could I others solutions... – Burak Kaymakci Sep 05 '20 at 10:38
This also helped me when using APScheduler (with multiple jobs), see: https://stackoverflow.com/questions/71632249/scrapy-reactoralreadyinstallederror-when-using-twistedscheduler – Melroy van den Berg Mar 29 '22 at 16:16
Don't forget to enable logging. – Melroy van den Berg Mar 31 '22 at 20:14

score 20 · Answer 2 · edited Jun 30 '18 at 16:39

20

This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()

I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
Hope this will help somebody, as it helped for me :)

edited Jun 30 '18 at 16:39

Christian Aichinger

6,989
4
40
60

answered Nov 30 '17 at 19:28

Chiefir

2,561
1
27
46

When I call import_module a error appears: `NameError: name 'import_module' is not defined` – olegario Jul 25 '18 at 15:13
@olegario check `from importlib import import_module` – Chiefir Jul 25 '18 at 15:27
I got it, but when I call this function the spider is not triggered – olegario Jul 25 '18 at 15:29
@olegario is there any message or error or smth? – Chiefir Jul 25 '18 at 16:47
no, there is no messages at all – olegario Jul 25 '18 at 16:48
@olegario did u entered the real path to your spider into `module_name` variable? – Chiefir Jul 25 '18 at 16:50
yes I did it. Should I call `reactor.run()` method? – olegario Jul 25 '18 at 16:52
@olegario no, with this my pice of code, u don't need `reactor.run`. Try to put some print at the end of this function - will u see it? – Chiefir Jul 25 '18 at 16:55
Yes I can see too. I don't know why this is happening – olegario Jul 25 '18 at 17:01
@olegario You should debug your spider. Seems like there are some problems from its part. May be also put some prints at the early beginning of it. – Chiefir Jul 25 '18 at 17:02
the spider is fully working too – olegario Jul 25 '18 at 23:15
Ok, my bad, it was just the log that was disabled. – olegario Jul 25 '18 at 23:20
@olegario so it worked? – Chiefir Jul 26 '18 at 05:51
Yes, thank you for help me – olegario Jul 26 '18 at 13:24
could you please check my new topic? https://stackoverflow.com/questions/51529817/scrapy-how-to-wait-until-runner-finishes?noredirect=1#comment90029354_51529817 I woul appreciate – olegario Jul 26 '18 at 16:33
You saved my day! You saved my life! Thanks a lot – Paulo Fabrício Sep 02 '18 at 18:14
1

it was not execute any spider on my end – Rocky Chen Aug 30 '19 at 08:30
You can't pass an instance of the Spider to `crawl()`, it throws `The crawler_or_spidercls argument cannot be a spider object, it must be a spider class`. – Burak Kaymakci Sep 05 '20 at 10:47

score 2 · Answer 3 · answered Jan 05 '17 at 23:08

As per the Scrapy documentation, the start() method of the CrawlerProcess class does the following:

"[...] starts a Twisted reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE."

The error you are receiving is being thrown by Twisted, because a Twisted reactor cannot be restarted. It uses a ton of globals, and even if you do jimmy-rig some sort of code to restart it (I've seen it done), there's no guarantee it will work.

Honestly, if you think you need to restart the reactor, you're likely doing something wrong.

Depending on what you want to do, I would also review the Running Scrapy from a Script portion of the documentation, too.

score 2 · Answer 4 · answered Feb 16 '19 at 02:38

As some people pointed out already: You shouldn't need to restart the reactor.

Ideally if you want to chain your processes (crawl1 then crawl2 then crawl3) you simply add callbacks.

For example, I've been using this loop spider that follows this pattern:

1. Crawl A
2. Sleep N
3. goto 1

And this is how it looks in scrapy:

import time

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.body)

def sleep(_, duration=5):
    print(f'sleeping for: {duration}')
    time.sleep(duration)  # block here


def crawl(runner):
    d = runner.crawl(HttpbinSpider)
    d.addBoth(sleep)
    d.addBoth(lambda _: crawl(runner))
    return d


def loop_crawl():
    runner = CrawlerRunner(get_project_settings())
    crawl(runner)
    reactor.run()


if __name__ == '__main__':
    loop_crawl()

To explain the process more the crawl function schedules a crawl and adds two extra callbacks that are being called when crawling is over: blocking sleep and recursive call to itself (schedule another crawl).

$ python endless_crawl.py 
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5

I actually wrote an extensive blog on this here http://crawl.blog/scrapy-loop/ as well as provided feature-rich implementation https://gitlab.com/granitosaurus/scrapy-loop — Granitosaurus, Mar 28 '19 at 08:54
Hi @Granitorsaurus. I'm going to have a go at implementing this but I've also been reading recently about memory leaks. If this process runs indefinitely through the recursive call will there be a memory leak? Also, the link to your blog is dead :( — Jossy, Jan 24 '22 at 16:17
@Jossy yes with Twisted there's always a risk of memory leaks unfortunately. That being said, new versions of scrapy and twisted are much better! I've migrated my blog to https://scrapecrow.com but I hadn't added article for scrapy loop yet but you can find source code of the article here: https://gitlab.com/granitosaurus/crawl.blog/-/blob/master/content/scrapy-loop/contents.lr — Granitosaurus, Jan 25 '22 at 04:07

score 1 · Answer 5 · edited Jun 17 '17 at 10:50

The mistake is in this code:

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler() returns two results, and for that purpose it is trying to start the process twice, restarting the Reactor, as pointed by @Rejected.

obtaining results running one single process, and storing both results in a tuple, is the way to go here:

def __call__(self):
    result1, result2 = test.web_crawler()

score 0 · Answer 6 · edited Nov 05 '17 at 23:48

0

This solved my problem,put below code after reactor.run() or process.start():

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)

edited Nov 05 '17 at 23:48

Josh Karpel

2,110
2
10
21

answered Nov 05 '17 at 23:00

Neeraj Yadav

1

you want to put your code in code blocks by surrounding it with ticks (`) or better yet by highlighting it and pressing ctrl + K(windows) or command + K (mac) – 0TTT0 Nov 05 '17 at 23:17
this will kill the process – Gihan Gamage Apr 30 '20 at 11:08

Scrapy - Reactor not Restartable

6 Answers6

Linked