I have a python script which checks in a particular folder if there are changes. To be more precise, I'm waiting for a new JSON file containing hours. If this JSON file appears, a function is called to schedule a task with the schedule library.
The created task run a spider at the scheduled hour.
The issue comes from me having multiple hours in my JSON file. The same spider is called multiple times (as many times as we have rows in the JSON file) and the program raises a ReactorNotRestartable error. The spider is called from the schedule.run_pending
row.
I'm pretty sure that the issue comes from multiple calls of the same spider because the program achieves the first step of the scraping (the first hour with the first URL), but it doesn't work for the second one.
I don't know how can I handle this reactor issue, can you give me some inputs?
Watchdogs module to monitor the repository
try:
#get json hours
hours = get_all_starting_hours('../data/output/result_debut.json')
logger.info(hours)
#get json urls
urls = get_all_urls('../data/output/result_debut.json')
logger.info(urls)
for hour, url in zip(hours, urls):
#schedule pour chaque url la tâche pour l'heure donnée
logger.info(hour)
logger.info(url)
# schedule scraping task
schedule.every().day.at(str(hour)).do(job_that_executes_once, url, process_settings=None)
while True:
logger.info('dans le while')
#run scheduled task
schedule.run_pending()
time.sleep(1)
except Exception as e:
logger.debug(str(e))
Schedule
def job_that_executes_once(url, process_settings):
logger.add("../data/logs/schedule_{time}.log")
logger.info('job a été lancé')
#run spider
run_MySpider(url)
return schedule.CancelJob
Spider
class MySpider(scrapy.Spider):
name = "enchere_detail"
logger.add('../data/logs/Spider_{time}.log')
def __init__(self, **kwargs):
super(MySpider, self).__init__(**kwargs)
self.start_urls = [kwargs.get('url_start')]
logger.info(self.start_urls)
def parse(self, response):
logger.info('debut du parse')
yield{
'fin_vente': response.css('span.stats-heure-fin::text').get(),
'url' : response.url
}
def run_MySpider(url):
process.crawl(MySpider, url_start = url)
process.start()
The error is
line 754, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Thank you