Scrapy - run at time interval

Question

i have a spider for crawling a site and i want to run it every 10 minutes. put it in python schedule and run it. after first run i got

ReactorNotRestartable

i try this sulotion and got

AttributeError: Can't pickle local object 'run_spider..f'

error.

edit: try how-to-schedule-scrapy-crawl-execution-programmatically python program run without error and crawl function run every 30 seconds but spider doesn't run and i don't get data.

def run_spider():
def f(q):
    try:
        runner = crawler.CrawlerRunner()
        deferred = runner.crawl(DivarSpider)
        #deferred.addBoth(lambda _: reactor.stop())
        #reactor.run()
        q.put(None)
    except Exception as e:
        q.put(e)

runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)

q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()

if result is not None:
    raise result

Possible duplicate of [How to schedule Scrapy crawl execution programmatically](https://stackoverflow.com/questions/47552507/how-to-schedule-scrapy-crawl-execution-programmatically) — notorious.no, Aug 01 '18 at 01:31
Is it possible to make the `AttributeError: Can't pickle local object 'run_spider..f'` go away in the solution above? I would really like to use it. — Jms, Jul 23 '19 at 03:15

score 4 · Accepted Answer · answered Jul 31 '18 at 13:07

4

The multiprocessing solution is a gross hack to work-around lack of understanding of how Scrapy and reactor management work. You can get rid of it and everything is much simpler.

from twisted.internet.task import LoopingCall
from twisted.internet import reactor

from scrapy.crawler import CrawlRunner
from scrapy.utils.log import configure_logging

from yourlib import YourSpider

configure_logging()
runner = CrawlRunner()
task = LoopingCall(lambda: runner.crawl(YourSpider()))
task.start(60 * 10)
reactor.run()

answered Jul 31 '18 at 13:07

Jean-Paul Calderone

47,755
6
94
122

1

From personal experience, I've found that `LoopingCall` isn't well suited for "scraping every X times" problems because it might lead to the same crawls occurring multiple times, which might mess up results. It's better to schedule the next crawl using callbacks from `runner.crawl()`. – notorious.no Aug 01 '18 at 01:36
1

LoopingCall definitely will not schedule two calls concurrently - as long as you return a `Deferred` defining when your call has finished. – Jean-Paul Calderone Aug 01 '18 at 11:41

score 0 · Answer 2 · answered Jul 31 '18 at 12:10

0

Easiest way I know to do it is using a separate script to call the script containing your twisted reactor, like this:

cmd = ['python3', 'auto_crawl.py']
subprocess.Popen(cmd).wait()

To run your CrawlerRunner every 10 minutes, you could use a loop or crontab on this script.

answered Jul 31 '18 at 12:10

Lucas Wieloch

818
7
19

Sorry, just realized you told you wanted to put it in python scheduler. I'll try to update this answer later. – Lucas Wieloch Jul 31 '18 at 12:14

Scrapy - run at time interval

2 Answers2