0

I have following code:

def whatever(url, data=None):   
    req = urllib2.Request(url)
    res = urllib2.urlopen(req, data)
    html = res.read()
    res.close()

I try use it for GET like this:

for i in range(1,20):
    whatever(someurl)

then, after the first 6 time behaves correct, then it blocks for 5 seconds, and continue works normal for rest GETs:

2012-06-29 15:20:22,487: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:22,507: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:22,528: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:22,552: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:22,569: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:22,592: Clear [127.0.0.1:49967]:   
**2012-06-29 15:20:26,486: Clear [127.0.0.1:49967]:**   
2012-06-29 15:20:26,515: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,555: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,586: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,608: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,638: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,655: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,680: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,700: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,717: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,753: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,770: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,789: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,809: Clear [127.0.0.1:49967]:   
2012-06-29 15:20:26,828: Clear [127.0.0.1:49967]:   

If using POST(with data={'a':'b'}), then each request gets stuck 2 seconds. I've try urllib2 and pycurl, they both give same result.

anyone has any idea about this weired behaviour?

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
pinkdawn
  • 1,023
  • 11
  • 20

2 Answers2

1

Another way to improve performance is to use threads:

import threading, urllib2
import Queue

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for i in xrange(1, 20):
        read_url("http://www.stackoverflow.com", result)
    return result

Gives me [Finished in 0.2s].

P.S. if you don't need a list, use xrange instead of range. Explanation

Community
  • 1
  • 1
maxwell
  • 857
  • 8
  • 16
  • 1
    this does help, but was not the root cause of the problem.. I found the final reason was pycurl was not thread-safe.. switch to urllib2, and using multi-thread solves my problem. – pinkdawn Jul 02 '12 at 02:47
0

The problem is in DNS resolver.

Here is a good explanation

You can use tool like this and as I suppose any other DNS resolver will resolve problem.

maxwell
  • 857
  • 8
  • 16
  • Are you really sure? cause I'm using ip address here.. [127.0.0.1:49967] – pinkdawn Jun 29 '12 at 08:03
  • Actually, in my local server using your code all was fine ([Finished in 0.9s] all 20 request). And as you used both tools, and both tools gives the same result, I suppose the problem is in server. – maxwell Jun 29 '12 at 08:06
  • I'm using this code in django context, and pycurl works bad under multi-thread(maybe the use of native code cause it), I switch to urllib2, and using @async from [decorator @async](http://micheles.googlecode.com/hg/decorator/documentation.html#async) – pinkdawn Jul 02 '12 at 02:50