I'm trying to run multiple Scrapy spiders with the CrawlerProcess class :
for market_id in markets_data["market_id"]:
starting_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['specific_url']
cond_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['cond_url']
crawl_limit = markets_data[markets_data['market_id'] == market_id].iloc[0]['crawl_limit']
market_language = markets_data[markets_data['market_id'] == market_id].iloc[0]['language']
print("Crawl limit : ", crawl_limit)
process = CrawlerProcess({
'CLOSESPIDER_PAGECOUNT': crawl_limit
})
process.crawl(SpiderKncs, market_id, starting_url, cond_url, market_language, markets_data,keywords_table)
process.start() # the script will block here until the crawling is finished
The issue here is the Exception twisted.internet.error.ReactorNotRestartable
, I know I can't have process.start()
in the loop but if process = CrawlerProcess({})
is outside, above the loop and process.start()
after the loop (creating only the CrawlerProcess object in the for) : I can't have my custom 'CLOSESPIDER_PAGECOUNT': crawl_limit settings.
If process = CrawlerProcess({'CLOSESPIDER_PAGECOUNT': crawl_limit})
and process.crawl()
are inside the loop and only process.start()
outside, after the for, I get a lot of twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.
Note that in my spider I'm scraping links recursively if it's the same domain with SpiderKncs.rules = Rule(LxmlLinkExtractor(allow=rf"({current_url}).*", unique=True), callback='parse_item', follow=True),
without depth limit, I just need a page count limit per domain current_url
.
What's the right way to do that ? Multiple spiders, different settings (CLOSESPIDER_PAGECOUNT here) ? Being struggling with this for days
EDIT : also have an issue if 'CONCURRENT_REQUESTS'
setting is more than one, closespider_pagecount doesn't work (eg. 14 pages scraped with pagecount limit = 2) but I read that the pages are already in queue. No idea how to handle the pagecount limit
EDIT2 : Tried CrawlerRunner like this :
for market_id in markets_data["market_id"]:
start = time.time()
starting_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['specific_url']
cond_url = markets_data[markets_data['market_id'] == market_id].iloc[0]['cond_url']
crawl_limit = markets_data[markets_data['market_id'] == market_id].iloc[0]['crawl_limit']
market_language = markets_data[markets_data['market_id'] == market_id].iloc[0]['language']
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner({
'CLOSESPIDER_PAGECOUNT': crawl_limit
})
d = runner.crawl(SpiderKncs, market_id, starting_url, cond_url, market_language, markets_data,
keywords_table)
# process = CrawlerProcess({
# 'CLOSESPIDER_PAGECOUNT': crawl_limit
# })
end = time.time()
print("All Data Sent Time Taken: {:.6f}s".format(end - start))
#process.start() # the script will block here until the crawling is finished
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
And other variations, but the spider opens, got all the middlewares logs and then idle doing nothing