0

I am newish to the world of distributed scrapy crawls, but I found out about scrapy-redis and have been using it. I am using it on a raspberry pi to scrape a large number of URLs that I push to redis. What I have been doing is creating multiple SSH sessions into the Pi, where I then run scrapy crawl myspider to have the spider "wait". I then start another SSH and do redis-cli lpush "my links". The crawlers then run, although I'm not sure how concurrent they are actually running.

I'm hoping this is clear, if not please let me know and I can clarify. I'm really just looking for a "next step" after implementing this barebones version of scrapy-redis.

edit: I based my starting point from this answer Extract text from 200k domains with scrapy. The answerer said he spun up 64 spiders using scrapy-redis.

Justin
  • 58
  • 1
  • 8

1 Answers1

1

What is the point on creating multiple SSH sessions? Concurrency? If that is the answer I believe scrapy itself can handle all of the urls at once with the concurrency that you want them to be a give accurate feedback of how the crawl went.

In that case you will only need 1 scrapy spider.

On the other hand, if the idea is to utilise multiple instances anyway, I suggest you to take a look at frontera (https://github.com/scrapinghub/frontera)

MartiONE
  • 56
  • 3
  • I'm aware of the concurrency capabilities, I guess I'm just not sure what value scrapy-redis has then. Do you need multiple machines (ip addresses) in order for it to be valuable? I will look into frontera. – Justin Aug 01 '20 at 14:22