2

Before you link me to other answers related to this, note that I've read them and am still a bit confused. Alrighty, here we go.

So I am creating a webapp in Django. I am importing the newest scrapy library to crawl a website. I am not using celery (I know very little about it, but saw it in other topics related to this).

One of the url's of our website, /crawl/, is meant to start the crawler running. It's the only url in our site that requires scrapy to be used. Here is the function which is called when the url is visited:

def crawl(request):
  configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
  runner = CrawlerRunner()

  d = runner.crawl(ReviewSpider)
  d.addBoth(lambda _: reactor.stop())
  reactor.run() # the script will block here until the crawling is finished

  return render(request, 'index.html')

You'll notice that this is an adaptation of the scrapy tutorial on their website. The first time this url is visited when the server starts running, everything works as intended. The second time and further, a ReactorNotRestartable exception is thrown. I understand that this exception happens when a reactor which has already been stopped is issued a command to start again, which is not possible.

Looking at the sample code, I would assume the line "runner = CrawlerRunner()" would return a ~new~ reactor for use each time this url is visited. But I believe perhaps my understanding of twisted reactors is not completely clear.

How would I go about getting and running a NEW reactor each time this url is visited?

Thank you so much

1 Answers1

2

Generally speaking, you can't have a new reactor. There's one global one. This is clearly a mistake and maybe it will be corrected in the future but that's the current state of affairs.

You might be able to use Crochet to manage a single reactor running (for the lifetime of your whole process - not repeatedly starting and stopping) in a separate thread.

Consider the example from the Crochet docs:

#!/usr/bin/python
"""
Do a DNS lookup using Twisted's APIs.
"""
from __future__ import print_function

# The Twisted code we'll be using:
from twisted.names import client

from crochet import setup, wait_for
setup()


# Crochet layer, wrapping Twisted's DNS library in a blocking call.
@wait_for(timeout=5.0)
def gethostbyname(name):
    """Lookup the IP of a given hostname.

    Unlike socket.gethostbyname() which can take an arbitrary amount of time
    to finish, this function will raise crochet.TimeoutError if more than 5
    seconds elapse without an answer being received.
    """
    d = client.lookupAddress(name)
    d.addCallback(lambda result: result[0][0].payload.dottedQuad())
    return d


if __name__ == '__main__':
    # Application code using the public API - notice it works in a normal
    # blocking manner, with no event loop visible:
    import sys
    name = sys.argv[1]
    ip = gethostbyname(name)
    print(name, "->", ip)

This gives you a blocking gethostbyname function that's implemented using Twisted APIs. The implementation uses twisted.names.client which just relies on being able to import the global reactor.

Note there is no reactor.run or reactor.stop call - just the Crochet setup call.

Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122
  • But how would I do this in the case of a django project? How do I make the reactor start on website start and end on website shut down? And how do I reference it later each time the crawler needs to run? – Chris Shael Peabody Jun 27 '17 at 19:35
  • To answer your first question, that's what Crochet does. :) The answer to the second part could take multiple forms - perhaps create an object that has a reference to the reactor or perhaps just rely on the global reactor import always giving you the same reactor. – Jean-Paul Calderone Jun 28 '17 at 12:34
  • 4
    It seems that this was deceptively simple. Hopefully it really was this simple. I merely commented out these lines: `d.addBoth(lambda _: reactor.stop()) reactor.run()` And added the import and setup call to the top of the file. And it seems to work smoothly. I don't have a great understanding of reactors and such so hopefully there's not something I'm missing. Thanks though! – Chris Shael Peabody Jul 07 '17 at 18:19
  • Can you show how you did it. Perhaps in answer, it would be very helpful, thanks – Debendra Jan 05 '20 at 15:39