I am trying to build a crawler (using scrapy) that launches spiders from a main.py with multiprocessing.
The first spider (cat_1) is launched without multiprocessing using scrapy.crawler.CrawlerProcess
:
crawler_settings = Settings()
crawler_settings.setmodule(default_settings)
runner = CrawlerProcess(settings=crawler_settings)
runner.crawl(cat_1)
runner.start(stop_after_crawl=True)
It works fine, I have all the data handled by the FEED.
The next spider needs the first spider's results and goes for multiprocessing :
After loading the results from first spider, I create a list of URL and send it to my function process_cat_2()
. This function creates processes and each one of them would launch the spider cat_2 :
from multiprocessing import Process
def launch_crawler_cat_2(crawler, url):
cat_name = url[0]
cat_url = url[1]
runner.crawl(crawler, cat_name, cat_url)
def process_cat_2(url_list):
nb_spiders = len(url_list)
list_process = [None] * nb_spiders
while(url_list):
for i in range(nb_spiders):
if not (list_process[i] and list_process[i].is_alive()):
list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
list_process[i].start()
# break
# Wait all thread end
for process in list_process:
if process:
# process.start()
process.join()
The problem is that runner.crawl(crawler, cat_name, cat_url)
(in cat_2) does not crawl anything :
2021-10-07 17:20:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
And I do not know how to use the existing twisted.internet.reactor
so to avoid this error :
twisted.internet.error.ReactorNotRestartable
When using :
def launch_crawler_cat_2(crawler, url):
cat_name = url[0]
cat_url = url[1]
runner.crawl(crawler, cat_name, cat_url)
runner.start()
How can I launch a new spider with the existing reactor object ?