1

I have ~80 000 urls and I'd like to get response statuse codes for them. Note, that I'd like to get it as fast as possible. I've tried HEAD and GET requests using requests python battery, but it's too slow for my goal. According to my calculations it shall take > 10 hours. It's sad. Another approach I've found is using tornado. I've tested it (please, take a look at the code) on 500 urls. It made his work fast, but (!) a huge amount of response codes are 599. It's strange, then I've checked urls which map to 599 code through a browser (simple GET request) and made sure that url is pretty fine. How to solve this problem?

from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
from tornado import ioloop, httpclient, gen
import tornado
from time import sleep

i = 0
good = 0


def handle_request(response):
    global good
    if response.code != 200:
        print response.code, response.reason, response.request.url
    else:
      good += 1
      print 'KKKKKKKKKKK: ', good, '[%s]' % response.request.url
    global i
    i -= 1
    if i == 0 or i < 0:
        ioloop.IOLoop.instance().stop()


http_client = httpclient.AsyncHTTPClient()
lis = []
for url in open('urls'):
    lis.append(url.strip())
specific_domain = '...'
for l in lis[:500]:
    i += 1
    method = 'GET' if specific_domain in l else 'HEAD'
    req = tornado.httpclient.HTTPRequest(l, method=method, request_timeout=30.0)
    http_client.fetch(req, handle_request)

ioloop.IOLoop.instance().start()
  • 1
    Have you considered using `multiprocessing` from the standard library. Here's an example: https://gist.github.com/hoffrocket/3493802 – Håken Lid Jan 30 '16 at 23:04

1 Answers1

2

599 is the response code Tornado generates for an internal timeout. In this case most of the requests are probably timing out in the queue while waiting for a slot. You can either increase the timeouts (pass request_timeout when making the request) or manage your own queue to feed requests into AsyncHTTPClient only as fast as they can be handled (this is normally recommended for large crawling jobs as it lets you make your own decisions about prioritization and fairness across different hosts). For an example with a queue, see my answer in tornado: AsyncHttpClient.fetch from an iterator?

Community
  • 1
  • 1
Ben Darnell
  • 21,844
  • 3
  • 29
  • 50
  • Ben, thanks. I've tried to increase `request_timeout` to 60, problem is the same. I've seen your solution, but there's a `Future` object as `response` from `httpClient's fetch method`. Is it possible to get status code? – user3780183 Jan 30 '16 at 22:28
  • Oops, that example was missing a `yield`. It should have been `response = yield http_client.fetch(...)`, and then you can use `response.code`. I've updated it now. – Ben Darnell Jan 30 '16 at 22:38