I have a couple of spiders in my scrapy project. Each of them collects data from various websites and stores it in the database (separately). After each spider is finished I need to run code which is doing things to the data (let's call it data processing subroutine). This takes variable amount of time (up to an hour) depending on the spider/data.
My goal is to have a script which runs these spiders simultaneously and also allows to trigger the data processing subroutine for each spider once the crawling is finished, while not interfering with the other still running spiders and other finished spiders' data processing subroutines. In other words, I want to do it all in a shortest amount of time.
I know I can run spiders simultaneously this way:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
I also know/think I could use spider_closed signal inside each of the spiders to trigger the data processing subroutine.
My questions are:
- Will this work as I imagine? Won't the data processing subroutines compete for resources since they are all in same process?
- Is there a way to use actual multiprocessing and run each spider in a separate process? Or some other, better way to do this?
Thank you.