3

I've got a script that gets DNS (CNAME, MX, NS) data in the following way:

from dns import resolver
...

def resolve_dns(url):
    response_dict = {}
    print "\nResolving DNS for %s" % (url)

    try: 
        response_dict['CNAME'] = [rdata for rdata in resolver.query(url, 'CNAME')]
    except:
        pass

    try: 
        response_dict['MX'] = [rdata for rdata in resolver.query(url, 'MX')]
    except:
        pass

    try: 
        response_dict['NS'] = [rdata for rdata in resolver.query(url, 'NS')]
    except:
        pass

    return response_dict

This function is being called sequentially for successive URLs. If possible, I'd like to speed up the above process by getting the data for multiple URLs simultaneously.

Is there a way to accomplish what the above script does for a batch of URLs (perhaps returning a list of dict objects, with each dict corresponding to the data for a particular a URL)?

Boa
  • 2,609
  • 1
  • 23
  • 38
  • If the resolver API does not allow batch URI lookups, I would suggest using threading or multiprocessing. http://stackoverflow.com/a/7207336/655757 – Hari Dec 19 '15 at 23:09

1 Answers1

5

You can put the work into a thread pool. Your resolve_dns does 3 requests serially so I created a slightly more generic worker that does only 1 query and used collections.product to generate all combinations. In the thread pool I set chunksize to 1 to reduce thread pool batching, which can increase exec time if some queries take a long time.

import dns
from dns import resolver
import itertools
import collections
import multiprocessing.pool

def worker(arg):
    """query dns for (hostname, qname) and return (qname, [rdata,...])"""
    try:
        url, qname = arg
        rdatalist = [rdata for rdata in resolver.query(url, qname)]
        return qname, rdatalist
    except dns.exception.DNSException, e:
        return qname, []

def resolve_dns(url_list):
    """Given a list of hosts, return dict that maps qname to
    returned rdata records.
    """
    response_dict = collections.defaultdict(list)
    # create pool for querys but cap max number of threads
    pool = multiprocessing.pool.ThreadPool(processes=min(len(url_list)*3, 60))
    # run for all combinations of hosts and qnames
    for qname, rdatalist in pool.imap(
            worker, 
            itertools.product(url_list, ('CNAME', 'MX', 'NS')),
            chunksize=1):
        response_dict[qname].extend(rdatalist)
    pool.close()
    return response_dict

url_list = ['example.com', 'stackoverflow.com']
result = resolve_dns(url_list)
for qname, rdatalist in result.items():
    print qname
    for rdata in rdatalist:
        print '   ', rdata
tdelaney
  • 73,364
  • 6
  • 83
  • 116