7

I am stuck while initiating multiple instances of same spider. I want to run it like 1 url for 1 spider instance. I have to process 50k urls and for this i need to initiate separate instances for each. In my main spider script, I have set closedpider timeut for 7 mins, to make sure that I am not crawling for a long time. Please see the code below:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import urlparse

for start_url in all_urls:
    domain = urlparse.urlparse(start_url).netloc
    if domain.startswith('ww'):
        domain = domain.split(".",1)[1]

    process = CrawlerProcess(get_project_settings())
    process.crawl('textextractor', start_url=start_url,allowed_domains=domain)
    process.start()

It runs completely for 1st url, bur after that when the 2nd url is passed it gives below error:

raise error.ReactorNotRestartable()
ReactorNotRestartable

Please suggest what should i do to make it run for multiple instances of same spider. Also, I am thinking to initiate multiple instances of scrapy at a time using threads. Would it be a fine approach?

  • Any update on this matter? – UriCS Sep 12 '16 at 19:49
  • Possible duplicate of [ReactorNotRestartable error in while loop with scrapy](https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy) – Gallaecio Jan 22 '19 at 11:53
  • How did you solve it- face similar issue. I loop over all my urls with CrawlerRunner but results are not as expected. Some urls are crawled some others are limited crawled and some others are not crawled at all. When I run each url with only one spider seperately results are as expected. It drives me crazy! – A.Papa Oct 21 '21 at 16:39

3 Answers3

0

How about this

process = CrawlerProcess(get_project_settings())

for start_url in all_urls:
    domain = urlparse.urlparse(start_url).netloc
    if domain.startswith('ww'):
        domain = domain.split(".",1)[1]
    process.crawl('textextractor', start_url=start_url,allowed_domains=domain)

process.start()
furas
  • 134,197
  • 12
  • 106
  • 148
  • furas I tested with your solution but it does not provide the right results.This solution do initiate multiple instances without giving ```ReactorNotRestartable``` error but it completely crawls the last passed url only and for other urls it does start crawling but do not crawl more than 1 url and finishes the spider. I have checked those urls separately and they do return alot of crawled data on them. Plus as i have mentioned, I have to do it for 50k urls do doing this means i ll be starting crawling process for 50k urls at once? Does it seem a fine approach? – user3721618 Nov 15 '15 at 09:22
0

Is there a specific reason you want to start 50k instances of spiders? Twisted by default only allows a single instance to run (unless you kill the entire process and restart)

Secondly, "1 url for 1 spider instance" will cause a huge overhead in memory. You should instead consider passing all the urls to the same instance.

mannmann2
  • 1
  • 1
  • 1
0

In my case, your purpose is not need. Because the spider in scrapy, every request you yield is asynchronous. That's no need to make multiple instances.

The way to speed up spider is Increase concurrency

And the way to process 50k urls is spider-arguments

Jadian
  • 4,144
  • 2
  • 13
  • 10