0

I'm trying to run multiple Scrapy spiders with the CrawlerProcess class :

for market_id in markets_data["market_id"]:
    starting_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['specific_url']
    cond_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['cond_url']
    crawl_limit = markets_data[markets_data['market_id'] == market_id].iloc[0]['crawl_limit']
    market_language = markets_data[markets_data['market_id'] == market_id].iloc[0]['language']
    print("Crawl limit : ", crawl_limit)

    process = CrawlerProcess({
            'CLOSESPIDER_PAGECOUNT': crawl_limit
    })
    process.crawl(SpiderKncs, market_id, starting_url, cond_url, market_language, markets_data,keywords_table)
    process.start()  # the script will block here until the crawling is finished

The issue here is the Exception twisted.internet.error.ReactorNotRestartable, I know I can't have process.start() in the loop but if process = CrawlerProcess({}) is outside, above the loop and process.start() after the loop (creating only the CrawlerProcess object in the for) : I can't have my custom 'CLOSESPIDER_PAGECOUNT': crawl_limit settings.

If process = CrawlerProcess({'CLOSESPIDER_PAGECOUNT': crawl_limit}) and process.crawl() are inside the loop and only process.start() outside, after the for, I get a lot of twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.

Note that in my spider I'm scraping links recursively if it's the same domain with SpiderKncs.rules = Rule(LxmlLinkExtractor(allow=rf"({current_url}).*", unique=True), callback='parse_item', follow=True), without depth limit, I just need a page count limit per domain current_url.

What's the right way to do that ? Multiple spiders, different settings (CLOSESPIDER_PAGECOUNT here) ? Being struggling with this for days

EDIT : also have an issue if 'CONCURRENT_REQUESTS'setting is more than one, closespider_pagecount doesn't work (eg. 14 pages scraped with pagecount limit = 2) but I read that the pages are already in queue. No idea how to handle the pagecount limit

EDIT2 : Tried CrawlerRunner like this :

for market_id in markets_data["market_id"]:
    start = time.time()
    starting_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['specific_url']
    cond_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['cond_url']
    crawl_limit = markets_data[markets_data['market_id'] == market_id].iloc[0]['crawl_limit']
    market_language = markets_data[markets_data['market_id'] == market_id].iloc[0]['language']
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner({
        'CLOSESPIDER_PAGECOUNT': crawl_limit
    })
    d = runner.crawl(SpiderKncs, market_id, starting_url, cond_url, market_language, markets_data,
                    keywords_table)
    # process = CrawlerProcess({
    #     'CLOSESPIDER_PAGECOUNT': crawl_limit
    # })

    end = time.time()
    print("All Data Sent Time Taken: {:.6f}s".format(end - start))
    #process.start()  # the script will block here until the crawling is finished
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

And other variations, but the spider opens, got all the middlewares logs and then idle doing nothing

  • plese check these question https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy – Leonardo Maffei Feb 25 '21 at 01:34
  • that one could further clarifies https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy – Leonardo Maffei Feb 25 '21 at 01:39
  • But in short: you CAN NOT start the reactor more than once PER PROCESS. Also, the reactor will end only when the process it self is finished or you manually stop it. – Leonardo Maffei Feb 25 '21 at 01:41
  • Already did in all my research, about the first link : didn't find what I'm looking for. It does not seem that CrawlerRunner allows me to change the settings between each spider, or it is not mentioned in the doc and neither the post. I have no idea how to create different CrawlerProcess or Runner with that. Second link is the same. I know I can't but I need to, there should be a way to run multiple spiders with different settings, for a framework that big... – TurboMachine Feb 25 '21 at 14:39
  • did you see the alternatives suggested? You could spawn a process and inside that process create another CrawlerProcess – Leonardo Maffei Feb 25 '21 at 15:12
  • Tried that and I get the ReactorNotRestartable error – TurboMachine Feb 25 '21 at 16:47

0 Answers0