1

My situation:

I have about 40 million pages to crawl, and all anti-spider measures be cracked.

Now I use scrapy can only crawl 60-100 pages one mintues on single computer. (website has enough performance and my bandwidth and cpu is well)

How can I increase the speed of crawling?

My start_urls have only one url. All next url is created by previous response. I think it is maybe a reson of my problem.

Some of my settings:

RETRY_ENABLED = 1 
RETRY_TIMES = 2
DOWNLOAD_TIMEOUT = 15
DOWNLOAD_DELAY = 0 
CONCURRENT_REQUESTS = 100 
CONCURRENT_REQUESTS_PER_DOMAIN = 100 
CONCURRENT_REQUESTS_PER_IP = 100

And if i can get all 40 million page urls, how should i do to increase the speed of crawling? (I really sure i can get this.)

Put all urls to the start_urls and set concurrent_requests 30 or bigger?

One method i have thought is that put all 40 million page urls to redis database, and create 10 or more threads to get url and crawl at the same time.

So have can i set threads to get url from redis at the same time. And all this should be in one computer.

How to increase Scrapy crawling speed on one single computer?

Partial code

def start_requests(self):

    url =  'https://www.xxxx.com/map_searchByLocation'

    longitude, latitude = get_next_coordinate(self.points, self.start_longitude, self.start_latitude, self.radius)

    data = get_form(longitude, latitude, self.radius)
    proxy = 'http://' + get_proxy()
    yield FormRequest(url, method='POST', formdata=data, callback=self.parse, dont_filter=True, meta={'proxy':proxy,'download_timeout':3,'longitude':data['longitude'], 'latitude':data['latitude'], 'data':data})

def parse(self, response):
    info_list = json.loads(response.text)
    if info_list['listCount']:
        for item in info_list['list']:
            item_loader = QiyeItemloader(item=QiyeItem())
            item_loader.add_value('hash', item['KeyNo'])
            item_loader.add_value('name', item['Name'])
            item_loader.add_value('longitude', response.meta['longitude'])
            item_loader.add_value('latitude', response.meta['latitude'])
            qiye_item= item_loader.load_item()
            yield qiye_item
    longitude, latitude = get_next_coordinate(self.points, response.meta['longitude'], response.meta['latitude'], self.radius)
    next_data = get_form(longitude, latitude, self.radius)
    yield FormRequest(response.url, method='POST', formdata = next_data, callback=self.parse, dont_filter=True, meta={'proxy':response.meta['proxy'],'download_timeout':3,'longitude':next_data['longitude'], 'latitude':next_data['latitude'], 'data':next_data})
0necXmTz
  • 11
  • 2
  • 2
    Try profiling your code instead of guessing the bottlenecks. [Measure, don't guess.](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – dyz May 05 '19 at 11:02
  • Thank you a lot! I have tried it, but I can't figure it out. I paste partial code, could you help me to analyse it? – 0necXmTz May 05 '19 at 13:19
  • 1
    So you mentioned that the next url isn't created until the previous response finishes. This is killing your throughput. I also don't believe you need Redis. What about reading sequentially through your urls file, N at a time and have N concurrent requests? – dyz May 05 '19 at 13:28
  • 1
    Also, performance will degrade when N is too large. 40M urls will take days. This is more of a distributed computing problem. – dyz May 05 '19 at 13:36
  • @dyz Thank you, So I should try to get a list of all urls and try to use a method to keep N urls in start_urls at all time so that I can get N requests at a time? I will try it, Hop it will helps. Thank you very much! – 0necXmTz May 05 '19 at 15:39
  • If you yield *all* other URLs from the first response handler, that is not the issue. Does increasing both CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_IP not increase concurrency? If so, what does your CPU usage looks like, does it reach 100% (of a core)? – Gallaecio May 06 '19 at 08:36
  • @Gallaecio Hi, I saw my CPU usage just now, the largest usage of a core was near 50%. So the bottlenecks was not CPU, right? As a fact, I yield requests one by one, only when previous response finished the next request was created. I think the bottlenecks maybe is that? – 0necXmTz May 06 '19 at 09:38
  • @Gallaecio Now I can get all urls instead of yield them one by one. So could i control the threads number of the spider? If i want crawl N pages at one time, how should i do? All of i know about do this is put the all urls into the start_urls and modify the CONCURRENT_REQUESTS. But the questions is I have 40M urls, put all of them into start_urls could make my computer crash. If there a way that i can dynamic feed urls in to start_urls? I know Distributed System but i want to improve the performance at one computer at first,could you help me ? Thank you very much! – 0necXmTz May 06 '19 at 09:38
  • If you have them in a text file, you can implement `start_requests()` to iterate through the file lines and yield the corresponding requests. For other options, search questions tagged with `scrapy`, as that’s a common topic. – Gallaecio May 06 '19 at 09:43
  • @Gallaecio Thank you! I will try it. – 0necXmTz May 06 '19 at 09:46
  • You're using a proxy. Chances are that it is your bottleneck. Which one is it? Is it a service you signed up or an app you've installed? How does it handle concurrency? – Thiago Curvelo May 06 '19 at 11:49
  • @ThiagoCurvelo Hi, I use proxy according to an api which I bought from a proxy agent.And I have tested it, proxy is not the bottleneck.(I try to not use proxy, but the speed of the spider has not improved significantly.) – 0necXmTz May 06 '19 at 12:01
  • @ThiagoCurvelo Now I find a method to increase the speed of the spider. I push all of urls into redis database and in function of start_request, I write an endless loop in which I constantly get url from redis and yield it. But how can I control the count of the urls in not-crawled queue? For example, I want 100 urls in not-crawled queue at all time, crawl one page add one url into not-crawled queue, so that my CPU will not crashed.(The above code will constantly put all of 40M urls into not-crawled queue, i think maybe make my computer crashed.(Just guessing, have not tested it.)) – 0necXmTz May 06 '19 at 12:26
  • Did you check [scrapy-redis](https://github.com/rmax/scrapy-redis)? Maybe it had what you want solved already. – Thiago Curvelo May 06 '19 at 12:44
  • @ThiagoCurvelo Thank you, From its name I know maybe it will helps, thanks!I will check it. – 0necXmTz May 06 '19 at 12:48

0 Answers0