I have ~80 000 urls and I'd like to get response statuse codes for them. Note, that I'd like to get it as fast as possible. I've tried HEAD
and GET
requests using requests
python battery, but it's too slow for my goal. According to my calculations it shall take > 10 hours. It's sad.
Another approach I've found is using tornado
. I've tested it (please, take a look at the code) on 500 urls. It made his work fast, but (!) a huge amount of response codes are 599. It's strange, then I've checked urls which map to 599 code through a browser (simple GET
request) and made sure that url is pretty fine. How to solve this problem?
from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
from tornado import ioloop, httpclient, gen
import tornado
from time import sleep
i = 0
good = 0
def handle_request(response):
global good
if response.code != 200:
print response.code, response.reason, response.request.url
else:
good += 1
print 'KKKKKKKKKKK: ', good, '[%s]' % response.request.url
global i
i -= 1
if i == 0 or i < 0:
ioloop.IOLoop.instance().stop()
http_client = httpclient.AsyncHTTPClient()
lis = []
for url in open('urls'):
lis.append(url.strip())
specific_domain = '...'
for l in lis[:500]:
i += 1
method = 'GET' if specific_domain in l else 'HEAD'
req = tornado.httpclient.HTTPRequest(l, method=method, request_timeout=30.0)
http_client.fetch(req, handle_request)
ioloop.IOLoop.instance().start()