2

Just a practical question. I do need to retrieve the HTTP status code of a site as well as the IP address.

Given the fact I normally need to parse between 10k and 150k domains, I was wondering which is the most efficient method.

I've seen that using the urllib2.urlopen(site) attempts to download the entire file stream connected to the file. At the same time the urllibs2 doesn't offer a method to convert an hostname into an IP.

Given I'm interested only in the HEAD bit to collect information like the HTTP status code and the IP address of that specific server, what is the best way to operate?

SHould I try to use only the socket? Thanks

Andrea Moro
  • 676
  • 2
  • 9
  • 20

1 Answers1

2

I think there is no one particular magic tool that will retrieve the HTTP status code of a site and the IP address.

For getting HTTP status code you should make a HEAD request using urllib2 or httplib or requests. Here's an example, taken from How do you send a HEAD HTTP request in Python 2?:

>>> import urllib2
>>> class HeadRequest(urllib2.Request):
...     def get_method(self):
...         return "HEAD"
... 
>>> response = urllib2.urlopen(HeadRequest("http://google.com/index.html"))

An example, using requests:

>>> import requests
>>> requests.head('http://google.com').status_code
301

Also, you might want to take a look at grequests in order to speed things up with getting status codes from multiple pages.

GRequests allows you to use Requests with Gevent to make asyncronous HTTP Requests easily.

For getting an IP address, you should use socket:

 socket.gethostbyname_ex('google.com')

Also see these threads:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195