2

I am planning to run Reverse DNS on 47 Million ips. Here is my code

with open(file,'r') as f:
    with open ('./ip_ptr_new.txt','a') as w:

        for l in f:

            la = l.rstrip('\n')
            ip,countdomain = la.split('|')
            ips.append(ip)

           try:
                ais = socket.gethostbyaddr(ip)
                print ("%s|%s|%s" % (ip,ais[0],countdomain), file = w)    
           except:
                print ("%s|%s|%s" % (ip,"None",countdomain), file = w)

Currently it is very slow. Does anybody have any suggestions for speed it up?

matino
  • 17,199
  • 8
  • 49
  • 58
UserYmY
  • 8,034
  • 17
  • 57
  • 71
  • Run multiple threads/processes – matino Jul 09 '15 at 12:02
  • @matino my machine has two cpu only. and with python multiprocessing module two would not speed it up that much. Can you elaborate more? – UserYmY Jul 09 '15 at 12:06
  • 2
    Just because you have 2 CPUs doesn't mean you can only run 2 processes at once. Try running with more - it's unlikely that this code is CPU-bound. – Tom Dalton Jul 09 '15 at 12:09
  • Well you could use 2 processes each spawning multiple threads, each thread 1) reads only part of the file 2) for that part is looking up the host – matino Jul 09 '15 at 12:12
  • @matino Thanks. It would be really helpful to me if you can write the answer separately . – UserYmY Jul 09 '15 at 12:16

3 Answers3

1

Try using a multiprocessing module. I have timed the performance for about 8000 ips and I got this:

#dns.py
real    0m2.864s
user    0m0.788s
sys     0m1.216s


#slowdns.py
real    0m17.841s
user    0m0.712s
sys     0m0.772s


# dns.py
from multiprocessing import Pool
import socket
def dns_lookup(ip):
    ip, countdomain = ip
    try:
        ais = socket.gethostbyaddr(ip)
        print ("%s|%s|%s" % (ip,ais[0],countdomain))
    except:
        print ("%s|%s|%s" % (ip,"None",countdomain))

if __name__ == '__main__':
    filename = "input.txt"
    ips = []
    with open(filename,'r') as f:
        with open ('./ip_ptr_new.txt','a') as w:
            for l in f:
                la = l.rstrip('\n')
                ip,countdomain = la.split('|')
                ips.append((ip, countdomain))
    p = Pool(5)
    p.map(dns_lookup, ips)





#slowdns.py
import socket
from multiprocessing import Pool

filename = "input.txt"
if __name__ == '__main__':
    ips = []
    with open(filename,'r') as f:
        with open ('./ip_ptr_new.txt','a') as w:
            for l in f:
               la = l.rstrip('\n')
               ip,countdomain = la.split('|')
               ips.append(ip)
               try:
                    ais = socket.gethostbyaddr(ip)
                    print ("%s|%s|%s" % (ip,ais[0],countdomain), file = w)
               except:
                    print ("%s|%s|%s" % (ip,"None",countdomain), file = w)
vHalaharvi
  • 179
  • 1
  • 10
  • Thanks for your help. The problem is that number of ips are a lot and to put them into a list it takes almost 90% of the memory. Is there a way to read line by line and pass it to multiprocessing module/.? – UserYmY Jul 09 '15 at 13:53
  • 1
    Sorry I missed that requirement. In that case you cannot use the list, instead you will have to use a queue. Reading from large files would still be ok since python uses generators and not lists. Please check this thread out. http://stackoverflow.com/questions/14677287/multiprocessing-with-large-data – vHalaharvi Jul 09 '15 at 14:17
  • hmm Thanks. But I still do not know how to make use of the queue in this case – UserYmY Jul 10 '15 at 07:25
  • OK, I will see if I'll be able to post an edit some time today with the actual code. – vHalaharvi Jul 10 '15 at 12:54
0

One solution here is to use the nslookup shell commande with the option timeout. Possibly the host command... An example not perfect but useful!

def sh_dns(ip,dns):
   a=subprocess.Popen(['timeout','0.2','nslookup','-norec',ip,dns],stdout=subprocess.PIPE)
   sortie=a.stdout.read()
   tab=str(sortie).split('=')
   if(len(tab)>1):
     return tab[len(tab)-1].strip(' \\n\'')
   else:
     return ""
YB45
  • 1
0

We recently had to deal with this problem also. Running on multiple processes didn't provide a good enough solution. It could take several days to process few millions of IPs from a strong AWS machine. What worked well is using Amazon EMR, it took around half an hour on 10 machines cluster. You cannot scale too much with one machine (and usually one network interface) as it's a network intensive task. Using Map Reduce with multiple machines certainly did the work.

Elhanan Mishraky
  • 2,736
  • 24
  • 26