Python scrapy ReactorNotRestartable substitute

Question

I have been trying to make an app in Python using Scrapy that has the following functionality:

A rest api (I had made that using flask) listens to all requests to crawl/scrap and return the response after crawling.(the crawling part is short enough, so the connection can be keep-alive till crawling gets completed.)

I am able to do this using the following code:

items = []
def add_item(item):
    items.append(item)

# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)

# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable 
crawler.crawl(requestParams=requestParams)
# start crawling 
reactor.run() #@UndefinedVariable
return str(items)

Now the problem I am facing is after making the reactor stop (which seems necessary to me since I don't want to stuck to the reactor.run()). I couldn't accept the further request after first request. After first request gets completed, I got the following error:

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
    reactor.run() #@UndefinedVariable
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
ReactorNotRestartable

Which is obvious, since we can not restart the reactor.

So my questions are:

1) How could I provide support for the next requests to crawl?

2) Is there any way to move to next line after reactor.run() without stopping it?

Does this [answer](http://stackoverflow.com/a/18924451/1117028) help? — Tiger-222, Sep 15 '16 at 08:51
See answers at http://stackoverflow.com/questions/32724537/building-a-restful-flask-api-for-scrapy and http://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy?noredirect=1&lq=1. — Mikhail Korobov, Sep 15 '16 at 16:59
@MikhailKorobov Thanks for sharing the links, [using subprocess](http://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy?noredirect=1&lq=1#answer-37270442) works for me, — sagar, Sep 15 '16 at 19:48

score 1 · Answer 1 · answered Sep 15 '16 at 16:46

I recommend you using a queue system like Rq (for simplicity, but there are few others).
You could have a craw function:

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

from spiders import MySpider

def runCrawler(url, keys, mode, outside, uniqueid): 

    runner = CrawlerRunner( get_project_settings() )

    d = runner.crawl( MySpider, url=url, param1=value1, ... )

    d.addBoth(lambda _: reactor.stop())
    reactor.run()

Then in your main code, use the Rq queue in order to collect crawler executions:

# other imports
pool = redis.ConnectionPool( host=REDIS_HOST, port=REDIS_PORT, db=your_redis_db_number)
redis_conn =redis.Redis(connection_pool=pool)  

q = Queue('parse', connection=redis_conn)

# urlSet is a list of http:// or https:// like url's
for url in urlSet:
    job = q.enqueue(runCrawler, url, param1, ... , timeout=600 )

Do not forget to start a rq worker process, working for the same queue name (here parse). For example, execute in a terminal session:

rq worker parse

score 1 · Accepted Answer · answered Sep 16 '16 at 15:45

Here is a simple solution to your problem

from flask import Flask
import threading
import subprocess
import sys
app = Flask(__name__) 

class myThread (threading.Thread):
    def __init__(self,target):
        threading.Thread.__init__(self)
        self.target = target
    def run(self):
        start_crawl()

def start_crawl():
    pid = subprocess.Popen([sys.executable, "start_request.py"])
    return


@app.route("/crawler/start") 
def start_req():
    print ":request"
    threadObj = myThread("run_crawler")
    threadObj.start()
    return "Your crawler is in running state" 
if (__name__ == "__main__"): 
    app.run(port = 5000)

In the above solution I assume that you are able to start your crawler from command line using command start_request.py file on shell/command line.

Now what we are doing is using threading in python to launch a new thread for each incoming request. Now you can easily able to run your crawler instance in parallel for each hit. Just control your number of threads using threading.activeCount()

Python scrapy ReactorNotRestartable substitute

2 Answers2