8

I just played around a little bit with python and threads, and realized even in a multithreaded script, DNS requests are blocking. Consider the following script:

from threading import Thread import socket

class Connection(Thread):
    def __init__(self, name, url):
        Thread.__init__(self)
        self._url = url
        self._name = name

    def run(self):
        print "Connecting...", self._name
        try:
            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            s.setblocking(0)
            s.connect((self._url, 80))
        except socket.gaierror:
            pass #not interested in it
        print "finished", self._name


if __name__ == '__main__':
    conns = []
    # all invalid addresses to see how they fail / check times
    conns.append(Connection("conn1", "www.2eg11erdhrtj.com"))
    conns.append(Connection("conn2", "www.e2ger2dh2rtj.com"))
    conns.append(Connection("conn3", "www.eg2de3rh1rtj.com"))
    conns.append(Connection("conn4", "www.ege2rh4rd1tj.com"))
    conns.append(Connection("conn5", "www.ege52drhrtj1.com"))

    for conn in conns:
        conn.start()

I dont know exactly how long the timeout is, but when running this the following happens:

  1. All Threads start and I get my printouts
  2. Every xx seconds, one thread displays finished, instead of all at once
  3. The Threads finish sequentially, not all at once (timeout = same for all!)

So my only guess is that this has to do with the GIL? Obviously the threads do not perform their task concurrently, only one connection is attempted at a time.

Does anyone know a way around this?

(asyncore doesnt help, and I'd prefer not to use twisted for now) Isn't it possible to get this simple little thing done with python?

Greetings, Tom

edit:

I am on MacOSX, I just let my friend run this on linux, and he actually does get the results I wished to get. His socket.connects()'s return immediately, even in a non Threaded environment. And even when he sets the sockets to blocking, and timeout to 10 seconds, all his Threads finish at the same time.

Can anyone explain this?

Community
  • 1
  • 1
Tom
  • 3,115
  • 6
  • 33
  • 38
  • 1
    Have you tried just using socket.getaddrinfo(host, port) to see if that has the same limitation? I unfortunately can't replicate this, because it's a dns problem. In most cases, you should get a "gaierror: (-2, 'Name or service not known')" rather quickly. – JimB Jul 31 '09 at 14:28
  • yea I have tried this, and it has the same limitation. – Tom Jul 31 '09 at 14:30
  • I'm pretty sure that OSX uses getaddrinfo from the BSD libraries, which probably has the restriction listed by below. – JimB Jul 31 '09 at 14:37
  • 1
    "isn't it possible to get this simple little thing done with python?" Well, twisted is python. It is a pure-python library. Just check its source code to see how it does. – nosklo Jul 31 '09 at 14:43
  • i wonder if twisted will suffer under the same limitations when run on mac os X – Tom Jul 31 '09 at 14:54
  • It won't. Twisted.names is done in all python, and doesn't call the hosts getaddrinfo function at all. From the api docs "The functions exposed in this module can be used for asynchronous name resolution and dns queries". – JimB Jul 31 '09 at 15:05
  • Shit! It's true. I got the same result. I don't know why Python sucks in this area! – jack Oct 27 '10 at 14:31

3 Answers3

15

On some systems, getaddrinfo is not thread-safe. Python believes that some such systems are FreeBSD, OpenBSD, NetBSD, OSX, and VMS. On those systems, Python maintains a lock specifically for the netdb (i.e. getaddrinfo and friends).

So if you can't switch operating systems, you'll have to use a different (thread-safe) resolver library, such as twisted's.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • I need this done pretty quickly, but I just ordered a book on twisted, people seem to really like it, cant be so bad after all. Will try to get into it a bit, but I am really busy lately. – Tom Jul 31 '09 at 14:55
  • There are other Python resolver libraries, such as dnspython and pydns, which might be simpler to use than twisted. They probably aren't fully thread-safe (creating a new UDP socket for each DNS request, and thus potentially exhausting port numbers), but that may not be a problem if you don't perform many queries. – Martin v. Löwis Jul 31 '09 at 15:48
  • I believe curl/libcurl use the c-ares resolver for this same reason. – Joe Aug 02 '09 at 01:20
2

if it's suitable you could use the multiprocessing module to enable process-based parallelism

import multiprocessing, socket

NUM_PROCESSES = 5

def get_url(url):
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.setblocking(0)
        s.connect((url, 80))
    except socket.gaierror:
        pass #not interested in it
    return 'finished ' + url


def main(url_list):
    pool = multiprocessing.Pool( NUM_PROCESSES )
    for output in pool.imap_unordered(get_url, url_list):
        print output

if __name__=="__main__":
    main("""
             www.2eg11erdhrtj.com 
             www.e2ger2dh2rtj.com
             www.eg2de3rh1rtj.com
             www.ege2rh4rd1tj.com
             www.ege52drhrtj1.com
          """.split())
m1k3y3
  • 2,762
  • 8
  • 39
  • 68
Joe Koberg
  • 25,416
  • 6
  • 48
  • 54
2

Send DNS requests asynchronously using Twisted Names:

import sys
from twisted.internet import reactor
from twisted.internet import defer
from twisted.names    import client
from twisted.python   import log

def process_names(names):
    log.startLogging(sys.stderr, setStdout=False)

    def print_results(results):
        for name, (success, result) in zip(names, results):
            if success:
                print "%s -> %s" % (name, result)
            else:
                print >>sys.stderr, "error: %s failed. Reason: %s" % (
                    name, result)

    d = defer.DeferredList(map(client.getHostByName, names), consumeErrors=True)
    d.addCallback(print_results)
    d.addErrback(defer.logError)
    d.addBoth(lambda _: reactor.stop())

reactor.callWhenRunning(process_names, """
    google.com
    www.2eg11erdhrtj.com 
    www.e2ger2dh2rtj.com
    www.eg2de3rh1rtj.com
    www.ege2rh4rd1tj.com
    www.ege52drhrtj1.com
    """.split())
reactor.run()
jfs
  • 399,953
  • 195
  • 994
  • 1,670