2

I have a set of 50 urls...and from each url, i am retrieving some data using urllib2. The procedure I am following (including setting cookies for each url) goes as follows:

urls = ['https://someurl', 'https://someurl', ...]
vals = []

for url in urls:
   req2 = urllib2.Request(url)
   req2.add_header('cookie', cookie)
   response = urllib2.urlopen(req2)
   data = response.read()
   vals.append(json.loads(data))

So, basically I am retrieving data from all these urls and dumping it in vals list. This entire procedure for 50 urls takes around 15.5 to 20 seconds. I need to know if there is any other python library through which I can do the same operation but in a more faster way..or if you guys can suggest any other faster way of approaching this issue using urllib2, then it'll be fine as well. Thanks.

user2480542
  • 2,845
  • 4
  • 24
  • 25
  • You should try using `requests`. It makes a lot of these things easier to manage. (Note that it won't resolve *performance* problems in that way, just make for much better code.) – Chris Morgan Jul 05 '13 at 03:51
  • Chris, can you elaborate with any single example? – user2480542 Jul 06 '13 at 04:33

3 Answers3

3

So if 15-20 secs are costly there are couple of things you can try :

  1. using threading with urllib2 itself . example is here
  2. you can try pycurl .( not sure about performance improvement)
  3. Once I used subprocess.Popen to run the curl command and get the response from URL in json format. I used it for calling different URLS in parallel and grabing the response as they arrive using communicate method of Popen object.
Community
  • 1
  • 1
Avichal Badaya
  • 3,423
  • 1
  • 21
  • 23
2

The speed of urllib2 won't be the limiting factor here, most of the time it will be waiting for TCP connections or for the remote server to respond.

Use of Python's multiprocessing module is fairly straightforward, but you could also use the threading module.

The multiprocessing.Pool could be used like this:

from multiprocessing import Pool
# Use the following if you prefer to use threads over processes.
# from multiprocessing.pool import ThreadPool as Pool

urls = ['https://someurl', 'https://someurl', ...]

def download_json(url):
    req2 = urllib2.Request(url)
    req2.add_header('cookie', cookie)
    response = urllib2.urlopen(req2)
    data = response.read()
    return json.loads(data)

pool = Pool()
vals = pool.map(download_json, urls)
Austin Phillips
  • 15,228
  • 2
  • 51
  • 50
1

urllib2 is pretty fast (20 seconds for 50 urls isn't this slow). It takes some time to connect to the resource.

What you want to do is multithreading.

Daniil Ryzhkov
  • 7,416
  • 2
  • 41
  • 58
  • hmm..I read an example of with Queue class and multiprocessing module..just wondering how that can be implemented? – user2480542 Jul 05 '13 at 02:08