4

i have a spider for crawling a site and i want to run it every 10 minutes. put it in python schedule and run it. after first run i got

ReactorNotRestartable

i try this sulotion and got

AttributeError: Can't pickle local object 'run_spider..f'

error.

edit: try how-to-schedule-scrapy-crawl-execution-programmatically python program run without error and crawl function run every 30 seconds but spider doesn't run and i don't get data.

def run_spider():
def f(q):
    try:
        runner = crawler.CrawlerRunner()
        deferred = runner.crawl(DivarSpider)
        #deferred.addBoth(lambda _: reactor.stop())
        #reactor.run()
        q.put(None)
    except Exception as e:
        q.put(e)

runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)

q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()

if result is not None:
    raise result
Soroush Karimi
  • 343
  • 4
  • 10
  • Possible duplicate of [How to schedule Scrapy crawl execution programmatically](https://stackoverflow.com/questions/47552507/how-to-schedule-scrapy-crawl-execution-programmatically) – notorious.no Aug 01 '18 at 01:31
  • Why not just set up a cronjob for it? simple – Umair Ayub Aug 01 '18 at 04:11
  • Is it possible to make the `AttributeError: Can't pickle local object 'run_spider..f'` go away in the solution above? I would really like to use it. – Jms Jul 23 '19 at 03:15

2 Answers2

4

The multiprocessing solution is a gross hack to work-around lack of understanding of how Scrapy and reactor management work. You can get rid of it and everything is much simpler.

from twisted.internet.task import LoopingCall
from twisted.internet import reactor

from scrapy.crawler import CrawlRunner
from scrapy.utils.log import configure_logging

from yourlib import YourSpider

configure_logging()
runner = CrawlRunner()
task = LoopingCall(lambda: runner.crawl(YourSpider()))
task.start(60 * 10)
reactor.run()
Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122
  • 1
    From personal experience, I've found that `LoopingCall` isn't well suited for "scraping every X times" problems because it might lead to the same crawls occurring multiple times, which might mess up results. It's better to schedule the next crawl using callbacks from `runner.crawl()`. – notorious.no Aug 01 '18 at 01:36
  • 1
    LoopingCall definitely will not schedule two calls concurrently - as long as you return a `Deferred` defining when your call has finished. – Jean-Paul Calderone Aug 01 '18 at 11:41
0

Easiest way I know to do it is using a separate script to call the script containing your twisted reactor, like this:

cmd = ['python3', 'auto_crawl.py']
subprocess.Popen(cmd).wait()

To run your CrawlerRunner every 10 minutes, you could use a loop or crontab on this script.

Lucas Wieloch
  • 818
  • 7
  • 19