17

I'm having a heck of a time getting asynchronous / threaded HTTPS requests to work using Python's urllib2.

Does anyone out there have a basic example that implements urllib2.Request, urllib2.build_opener and a subclass of urllib2.HTTPSHandler?

Thanks!

SeaTurtle
  • 493
  • 1
  • 3
  • 13
  • 1
    welcome to SO...do **you** have an example of what's not working for you currently? Might be easier to diagnose than to start from scratch in an answer here... – AJ. Apr 27 '11 at 17:32
  • 2
    Is there a rule that every question has to be "debug my code?" My code is full of crazy references to variables I'd rather not explain, sensitive URLs, etc. This is 10 lines of code for someone who knows how to do it. – SeaTurtle Apr 27 '11 at 21:21
  • I see there is no accepted answer. Are you still interested in this? I've solved this issue a few days ago, so I could take the time to write a detailed answer with code.. – MestreLion Jul 29 '14 at 02:19

5 Answers5

11

The code below does 7 http requests asynchronously at the same time. It does not use threads, instead it uses asynchronous networking with the twisted library.

from twisted.web import client
from twisted.internet import reactor, defer

urls = [
 'http://www.python.org', 
 'http://stackoverflow.com', 
 'http://www.twistedmatrix.com', 
 'http://www.google.com',
 'http://launchpad.net',
 'http://github.com',
 'http://bitbucket.org',
]

def finish(results):
    for result in results:
        print 'GOT PAGE', len(result), 'bytes'
    reactor.stop()

waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)

reactor.run()
nosklo
  • 217,122
  • 57
  • 293
  • 297
  • 1
    Would rather not distribute my script with a Twisted requirement. Can you do this with built-ins urllib2.Request, urllib2.build_opener and a subclass of urllib2.HTTPSHandler? – SeaTurtle Apr 27 '11 at 21:23
  • @SeaTurtle: Twisted is open source and written in pure python. You could get the relevant parts from twisted and include in your code. In other words - consider ***twisted** itself* to be the example of how to do it with built-ins. – nosklo Apr 28 '11 at 21:07
8

there's a really simple way, involving a handler for urllib2, which you can find here: http://pythonquirks.blogspot.co.uk/2009/12/asynchronous-http-request.html

#!/usr/bin/env python

import urllib2
import threading

class MyHandler(urllib2.HTTPHandler):
    def http_response(self, req, response):
        print "url: %s" % (response.geturl(),)
        print "info: %s" % (response.info(),)
        for l in response:
            print l
        return response

o = urllib2.build_opener(MyHandler())
t = threading.Thread(target=o.open, args=('http://www.google.com/',))
t.start()
print "I'm asynchronous!"

t.join()

print "I've ended!"
lkcl
  • 199
  • 1
  • 7
  • 5
    I would just like to warn, that while this method is easy and fast it is very prone to problems when something breaks (ex: URL is not available). There is a nice beginner guide on threading at http://www.ibm.com/developerworks/aix/library/au-threadingpython/ which includes a very simple example of an Async urllib2 solution. – stricjux May 18 '12 at 12:07
  • no. this is blocking io. you are not getting the benefits of multi-threading here. you are just sharing the whole task into time slices. total time will be the same in that manner. – Ali Berat Çetin Jul 07 '22 at 19:56
5

here is an example using urllib2 (with https) and threads. Each thread cycles through a list of URL's and retrieves the resource.

import itertools
import urllib2
from threading import Thread


THREADS = 2
URLS = (
    'https://foo/bar',
    'https://foo/baz',
    )


def main():
    for _ in range(THREADS):
        t = Agent(URLS)
        t.start()


class Agent(Thread):
    def __init__(self, urls):
        Thread.__init__(self)
        self.urls = urls

    def run(self):
        urls = itertools.cycle(self.urls)
        while True:
            data = urllib2.urlopen(urls.next()).read()


if __name__ == '__main__':
    main()
Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
1

You can use asynchronous IO to do this.

requests + gevent = grequests

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

import grequests

urls = [
    'http://www.heroku.com',
    'http://tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)
grequests.map(rs)
bmpasini
  • 1,503
  • 1
  • 23
  • 43
0

here is the code from eventlet

urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
     "https://wiki.secondlife.com/w/images/secondlife.jpg",
     "http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]

import eventlet
from eventlet.green import urllib2

def fetch(url):

  return urllib2.urlopen(url).read()

pool = eventlet.GreenPool()

for body in pool.imap(fetch, urls):
  print "got body", len(body)
Xavier Combelle
  • 10,968
  • 5
  • 28
  • 52
  • Hi there, I'd rather not distribute my script with an eventlet requirement. Can you do this with built-ins urllib2.Request, urllib2.build_opener and a subclass of urllib2.HTTPSHandler? – SeaTurtle Apr 27 '11 at 21:26
  • No that is not possible. Moreover, if I'm right it only function under linux. – Xavier Combelle Apr 28 '11 at 11:43