0

I have a python script which checks in a particular folder if there are changes. To be more precise, I'm waiting for a new JSON file containing hours. If this JSON file appears, a function is called to schedule a task with the schedule library.

The created task run a spider at the scheduled hour. The issue comes from me having multiple hours in my JSON file. The same spider is called multiple times (as many times as we have rows in the JSON file) and the program raises a ReactorNotRestartable error. The spider is called from the schedule.run_pending row.

I'm pretty sure that the issue comes from multiple calls of the same spider because the program achieves the first step of the scraping (the first hour with the first URL), but it doesn't work for the second one.

I don't know how can I handle this reactor issue, can you give me some inputs?

Watchdogs module to monitor the repository

try:
    #get json hours
    hours = get_all_starting_hours('../data/output/result_debut.json')
    logger.info(hours)
    #get json urls
    urls = get_all_urls('../data/output/result_debut.json')
    logger.info(urls)
    for hour, url in zip(hours, urls):
        #schedule pour chaque url la tâche pour l'heure donnée
        logger.info(hour)
        logger.info(url)
        # schedule scraping task                    
        schedule.every().day.at(str(hour)).do(job_that_executes_once, url, process_settings=None)
    while True:
        logger.info('dans le while')
        #run scheduled task
        schedule.run_pending()
        time.sleep(1)
except Exception as e:
     logger.debug(str(e))

Schedule

def job_that_executes_once(url, process_settings):
    logger.add("../data/logs/schedule_{time}.log")
    logger.info('job a été lancé')
    #run spider
    run_MySpider(url)
return schedule.CancelJob

Spider

class MySpider(scrapy.Spider):
name = "enchere_detail"

logger.add('../data/logs/Spider_{time}.log')

def __init__(self, **kwargs):
    super(MySpider, self).__init__(**kwargs)
    self.start_urls = [kwargs.get('url_start')]
    logger.info(self.start_urls)

def parse(self, response):
    logger.info('debut du parse')
    yield{
            'fin_vente': response.css('span.stats-heure-fin::text').get(),
            'url' : response.url
            }

def run_MySpider(url):
    process.crawl(MySpider, url_start = url)
    process.start()

The error is

line 754, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

Thank you

BPDESILVA
  • 2,040
  • 5
  • 15
  • 35
Jordan
  • 1
  • 1
  • I'm not sure if this is a duplicate question, or else just highly relevant, but check out https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy . – Scott Mermelstein Jul 02 '19 at 14:56
  • 1
    I can't use `the process.start(stop_after_crawl=False)` because it blocks the main process, so the run_pending can't lauch the spider's call. The addCallback option is more interesting but I don't know how can I organize my code in order to get the wanted result... – Jordan Jul 02 '19 at 15:14
  • Possible duplicate of [ReactorNotRestartable error in while loop with scrapy](https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy) – Gallaecio Jul 05 '19 at 13:08

1 Answers1

0

Unfortunately, I didn't find the way with scrapy. I writed a code with beautifulSoup, it can allow me to execute multiple times the same piece of code.

Jordan
  • 1
  • 1