crawl a list of sites one by one with scrapy

Question

I am trying to crawl a list of sites with scrapy. I tried to put the list of website urls as the start_urls, but then I found I couldn't afford so much memory with it. Is there any way to set the scrapy crawling one or two sites at a time?

score 3 · Accepted Answer · answered Jan 14 '13 at 08:37

3

You can try using concurrent_requests = 1 so that you don't overloaded with data

http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests

answered Jan 14 '13 at 08:37

Mirage

30,868
62
166
261

score 0 · Answer 2 · answered Jan 13 '13 at 18:56

0

You can define a start_requests method which iterates through requests to your URLs. This should avoid the overhead of holding all your start urls in memory at once and is the simplest approach to solve the problem you described.

If this is still a lot of URLs for scrapy to hold in memory during the crawl, you can enable persistence support.

If you really want to only feed scrapy a few URLs at a time, this is possible by registering for the spider_idle signal and in you callback function add the next few URLs and raise DontCloseSpider.

answered Jan 13 '13 at 18:56

Shane Evans

2,234
16
15

Thanks. I tried start_requests and iterator within it, but it didn't help. From the log, I noticed the spider still crawled multiples domains before it went deep. I understand the persistence support is for storing what spider crawled so next time the spider could start from it, it may not be the case I am looking for. If I understood wrong, please correct me. Can you share more about the third approach with spider_idle signal? I have limited experience with scrapy. – David Thompson Jan 13 '13 at 20:44
I thought the reason not to use start_urls was memory usage? and in that case start_requests means you do not need to put all requests in memory. Persistence support avoids holding outstanding (yet to be made) requests in memory. If you want to limit concurrency & control crawl order that is also possible, but I'm not really understanding why you need to do this and what you need to achieve. – Shane Evans Jan 13 '13 at 22:12
The only problem I need to solve is to reduce the memory consumption with a list of sites. I used start_requests and an iterator within it, but the spider still crawled multiple domain urls. – David Thompson Jan 14 '13 at 03:51
and how was memory usage? You seem to be assuming that crawling urls from multiple domains causes memory issues, but it should be fine – Shane Evans Jan 14 '13 at 12:12
the memory increases proportionally with the number of websites I include in the start_urls, so I assume that crawling urls from multiple domains causes memory issues. – David Thompson Jan 15 '13 at 02:32
I am blocked from posting questions, so I cannot post any question. I seek for a help :- `allowed_domains = ["fake1.com","fake2.com"] start_urls = ["http://www.fake1.com","http://www.fake2.com"]` I would like to run the scrapy by one-by-one url from start_urls and allowing the same positioned/indexed allowed_domains. Ex:- when the scrapy loads the **www.fake1.com** it should recursively load all the internal links i.e,. allowing the url which contains only **fake1.com** – Vinodh Velumayil Jul 13 '15 at 12:29

crawl a list of sites one by one with scrapy

2 Answers2

Linked