0

I'm running scrapy as a AWS lambda function. Inside my function I need to have a timer to see whether it's running longer than 1 minute and if so, I need to run some logic. Here is my code:

def handler():
    x = 60
    watchdog = Watchdog(x)
    try:
        runner = CrawlerRunner()
        runner.crawl(MySpider1)
        runner.crawl(MySpider2)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
    except Watchdog:
        print('Timeout error: process takes longer than %s seconds.' % x)
        # some other logic here
    watchdog.stop()

Watchdog timer class I took from this answer. The problem is the code never hits that except Watchdog block, but rather throws an exception outside:

Exception in thread Thread-1:
 Traceback (most recent call last):
   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
     self.run()
   File "/usr/lib/python3.6/threading.py", line 1182, in run
     self.function(*self.args, **self.kwargs)
   File "./functions/python/my_scrapy/index.py", line 174, in defaultHandler
     raise self
 functions.python.my_scrapy.index.Watchdog: 1

I need to catch exception in the function. How would I go about that. PS: I'm very new to Python.

rpanai
  • 12,515
  • 2
  • 42
  • 64
terreb
  • 1,417
  • 2
  • 23
  • 40

2 Answers2

2

Alright this question had me going a little crazy, here is why that doesn't work:

What the Watchdog object does is create another thread where the exception is raised but not handled (the exception is only handled in the main process). Luckily, twisted has some neat features.

You can do it running the reactor in another thread:

import time
from threading import Thread
from twisted.internet import reactor

runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
Thread(target=reactor.run, args=(False,)).start()  # reactor will run in a different thread so it doesn't lock the script here

time.sleep(60)  # Lock script here

# Now check if it's still scraping
if reactor.running:
    # do something
else:
    # do something else

I'm using python 3.7.0

Rafael Almeida
  • 5,142
  • 2
  • 20
  • 33
0

Twisted has scheduling primitives. For example, this program runs for about 60 seconds:

from twisted.internet import reactor
reactor.callLater(60, reactor.stop)
reactor.run()
Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122