3

I am looking to do a large number of reverse DNS lookups in a small amount of time. I currently have implemented an asynchronous lookup using socket.gethostbyaddr and concurrent.futures thread pool, but am still not seeing the desired performance. For example, the script took about 22 minutes to complete on 2500 IP addresses.

I was wondering if there is any quicker way to do this without resorting to something like adns-python. I found this http://blog.schmichael.com/2007/09/18/a-lesson-on-python-dns-and-threads/ which provided some additional background.

Code Snippet:

ips = [...]
with concurrent.futures.ThreadPoolExecutor(max_workers = 16) as pool:
    list(pool.map(get_hostname_from_ip, ips))
def get_hostname_from_ip(ip):
    try:
        return socket.gethostbyaddr(ip)[0]
    except:
        return ""

I think part of the issue is that many of the IP addresses are not resolving and timing out. I tried:

socket.setdefaulttimeout(2.0)

but it seems to have no effect.

Wert
  • 133
  • 1
  • 8

3 Answers3

3

I discovered my main issue was IPs failing to resolve and thus sockets not obeying their set timeouts and failing after 30 seconds. See Python 2.6 urlib2 timeout issue.

adns-python was a no-go because of its lack of support for IPv6 (without patches).

After searching around I found this: Reverse DNS Lookups with dnspython and implemented a similar version in my code (his code also uses an optional thread pool and implements a timeout).

In the end I used dnspython with a concurrent.futures thread pool for asynchronous reverse DNS lookups (see Python: Reverse DNS Lookup in a shared hosting and Dnspython: Setting query timeout/lifetime). With a timeout of 1 second this cut runtime from about 22 minutes to about 16 seconds on 2500 IP addresses. The large difference can probably be attributed to the Global Interpreter Lock on sockets and the 30 second timeouts.

Code Snippet:

import concurrent.futures
from dns import resolver, reversename
dns_resolver = resolver.Resolver()
dns_resolver.timeout = 1
dns_resolver.lifetime = 1
ips = [...]
results = []

with concurrent.futures.ThreadPoolExecutor(max_workers = 16) as pool:
    results = list(pool.map(get_hostname_from_ip, ips))

def get_hostname_from_ip(ip):
    try:
        reverse_name = reversename.from_address(ip)
        return dns_resolver.query(reverse_name, "PTR")[0].to_text()[:-1]
    except:
        return ""
Community
  • 1
  • 1
Wert
  • 133
  • 1
  • 8
1

Because of the Global Interpreter Lock, you should use ProcessPoolExecutor instead. https://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor

Addison
  • 1,065
  • 12
  • 17
  • 1
    I have a very limited understanding of the Python GIL but since the majority of time is spent blocking on IO shouldn't this not matter? Edit: I use similar code for asynchronous http requests and it seems to do the trick. – Wert Jun 07 '14 at 07:53
  • Did you figure out what the issue was? – Addison Jun 07 '14 at 08:19
  • Yes it was IPs failing to resolve. The sockets were ignoring their timeout and some were taking up to 30s before failing. Browsing around I found this: http://stackoverflow.com/questions/14127115/python-2-6-urlib2-timeout-issue which explains the issue. I will post my solution. – Wert Jun 08 '14 at 00:07
0

please, use asynchronous DNS, everything else will give you a very poor performance.

lenik
  • 23,228
  • 4
  • 34
  • 43
  • I found that my answer using dnspython gave quite reasonable performance. I wonder how *adns-python* would compare. – Wert Jun 08 '14 at 00:33