My situation:
I have about 40 million pages to crawl, and all anti-spider measures be cracked.
Now I use scrapy can only crawl 60-100 pages one mintues on single computer. (website has enough performance and my bandwidth and cpu is well)
How can I increase the speed of crawling?
My start_urls
have only one url. All next url is created by previous response.
I think it is maybe a reson of my problem.
Some of my settings:
RETRY_ENABLED = 1
RETRY_TIMES = 2
DOWNLOAD_TIMEOUT = 15
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100
And if i can get all 40 million page urls, how should i do to increase the speed of crawling? (I really sure i can get this.)
Put all urls to the start_urls and set concurrent_requests 30 or bigger?
One method i have thought is that put all 40 million page urls to redis database, and create 10 or more threads to get url and crawl at the same time.
So have can i set threads to get url from redis at the same time. And all this should be in one computer.
How to increase Scrapy crawling speed on one single computer?
Partial code
def start_requests(self):
url = 'https://www.xxxx.com/map_searchByLocation'
longitude, latitude = get_next_coordinate(self.points, self.start_longitude, self.start_latitude, self.radius)
data = get_form(longitude, latitude, self.radius)
proxy = 'http://' + get_proxy()
yield FormRequest(url, method='POST', formdata=data, callback=self.parse, dont_filter=True, meta={'proxy':proxy,'download_timeout':3,'longitude':data['longitude'], 'latitude':data['latitude'], 'data':data})
def parse(self, response):
info_list = json.loads(response.text)
if info_list['listCount']:
for item in info_list['list']:
item_loader = QiyeItemloader(item=QiyeItem())
item_loader.add_value('hash', item['KeyNo'])
item_loader.add_value('name', item['Name'])
item_loader.add_value('longitude', response.meta['longitude'])
item_loader.add_value('latitude', response.meta['latitude'])
qiye_item= item_loader.load_item()
yield qiye_item
longitude, latitude = get_next_coordinate(self.points, response.meta['longitude'], response.meta['latitude'], self.radius)
next_data = get_form(longitude, latitude, self.radius)
yield FormRequest(response.url, method='POST', formdata = next_data, callback=self.parse, dont_filter=True, meta={'proxy':response.meta['proxy'],'download_timeout':3,'longitude':next_data['longitude'], 'latitude':next_data['latitude'], 'data':next_data})