Am I using scrapy-redis correctly for scraping large number of URLs?

Question

I am newish to the world of distributed scrapy crawls, but I found out about scrapy-redis and have been using it. I am using it on a raspberry pi to scrape a large number of URLs that I push to redis. What I have been doing is creating multiple SSH sessions into the Pi, where I then run scrapy crawl myspider to have the spider "wait". I then start another SSH and do redis-cli lpush "my links". The crawlers then run, although I'm not sure how concurrent they are actually running.

I'm hoping this is clear, if not please let me know and I can clarify. I'm really just looking for a "next step" after implementing this barebones version of scrapy-redis.

edit: I based my starting point from this answer Extract text from 200k domains with scrapy. The answerer said he spun up 64 spiders using scrapy-redis.

score 1 · Answer 1 · answered Jul 31 '20 at 07:14

1

What is the point on creating multiple SSH sessions? Concurrency? If that is the answer I believe scrapy itself can handle all of the urls at once with the concurrency that you want them to be a give accurate feedback of how the crawl went.

In that case you will only need 1 scrapy spider.

On the other hand, if the idea is to utilise multiple instances anyway, I suggest you to take a look at frontera (https://github.com/scrapinghub/frontera)

answered Jul 31 '20 at 07:14

MartiONE

56
3

I'm aware of the concurrency capabilities, I guess I'm just not sure what value scrapy-redis has then. Do you need multiple machines (ip addresses) in order for it to be valuable? I will look into frontera. – Justin Aug 01 '20 at 14:22

Am I using scrapy-redis correctly for scraping large number of URLs?

1 Answers1