I am stuck while initiating multiple instances of same spider. I want to run it like 1 url for 1 spider instance. I have to process 50k urls and for this i need to initiate separate instances for each. In my main spider script, I have set closedpider timeut for 7 mins, to make sure that I am not crawling for a long time. Please see the code below:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import urlparse
for start_url in all_urls:
domain = urlparse.urlparse(start_url).netloc
if domain.startswith('ww'):
domain = domain.split(".",1)[1]
process = CrawlerProcess(get_project_settings())
process.crawl('textextractor', start_url=start_url,allowed_domains=domain)
process.start()
It runs completely for 1st url, bur after that when the 2nd url is passed it gives below error:
raise error.ReactorNotRestartable()
ReactorNotRestartable
Please suggest what should i do to make it run for multiple instances of same spider. Also, I am thinking to initiate multiple instances of scrapy at a time using threads. Would it be a fine approach?