Scrapy - multiple spiders - processing data from one spider while other are still running

Question

I have a couple of spiders in my scrapy project. Each of them collects data from various websites and stores it in the database (separately). After each spider is finished I need to run code which is doing things to the data (let's call it data processing subroutine). This takes variable amount of time (up to an hour) depending on the spider/data.

My goal is to have a script which runs these spiders simultaneously and also allows to trigger the data processing subroutine for each spider once the crawling is finished, while not interfering with the other still running spiders and other finished spiders' data processing subroutines. In other words, I want to do it all in a shortest amount of time.

I know I can run spiders simultaneously this way:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
I also know/think I could use spider_closed signal inside each of the spiders to trigger the data processing subroutine.

My questions are:

Will this work as I imagine? Won't the data processing subroutines compete for resources since they are all in same process?
Is there a way to use actual multiprocessing and run each spider in a separate process? Or some other, better way to do this?

Thank you.

score 0 · Answer 1 · answered Feb 09 '21 at 13:16

Won't the data processing subroutines compete for resources since they are all in same process?

They will compete, just as the spiders do. If that’s not OK, you may need to use multiprocessing.

Is there a way to use actual multiprocessing and run each spider in a separate process?

Mix Python Twisted with multiprocessing?

Scrapy - multiple spiders - processing data from one spider while other are still running

1 Answers1