6

I currently have a little script that downloads a webpage and extracts some data I'm interested in. Nothing fancy.

Currently I'm downloading the page like so:

import commands
command = 'wget --output-document=- --quiet --http-user=USER --http-password=PASSWORD https://www.example.ca/page.aspx'
status, text = commands.getstatusoutput(command)

Although this works perfectly, I thought it'd make sense to remove the dependency on wget. I thought it should be trivial to convert the above to urllib2, but thus far I've had zero success. The Internet is full urllib2 examples, but I haven't found anything that matches my need for simple username and password HTTP authentication with a HTTPS server.

Parker Coates
  • 8,520
  • 3
  • 31
  • 37

3 Answers3

6

this says, it should be straight forward

[as] long as your local Python has SSL support.

If you use just HTTP Basic Authentication, you must set different handler, as described here.

Quoting the example there:

import urllib2

theurl = 'http://www.someserver.com/toplevelurl/somepage.htm'
username = 'johnny'
password = 'XXXXXX'
# a great password

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
# this creates a password manager
passman.add_password(None, theurl, username, password)
# because we have put None at the start it will always
# use this username/password combination for  urls
# for which `theurl` is a super-url

authhandler = urllib2.HTTPBasicAuthHandler(passman)
# create the AuthHandler

opener = urllib2.build_opener(authhandler)

urllib2.install_opener(opener)
# All calls to urllib2.urlopen will now use our handler
# Make sure not to include the protocol in with the URL, or
# HTTPPasswordMgrWithDefaultRealm will be very confused.
# You must (of course) use it when fetching the page though.

pagehandle = urllib2.urlopen(theurl)
# authentication is now handled automatically for us

If you do Digest, you'll have to set some additional headers, but they are the same regardless of SSL usage. Google for python+urllib2+http+digest.

Cheers,

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
  • Sorry, didn't get the authentication part. I'll update my answer in a second. – Boldewyn Jun 25 '09 at 20:00
  • Oho, oh. Looks like you'll have to do some extra work in urllib2: http://docs.python.org/howto/urllib2.html Basically, urllib2 does basic authentication also via headers. Sorry. – Boldewyn Jun 26 '09 at 08:06
  • Ive tried with both HTTPBasicAuthHandler and HTTPDigestAuthHandler, but this is still giving me 401 errors. – Parker Coates Jun 26 '09 at 12:50
  • In the server allows non-authenticated access, the only way to authenticate with urllib2 is to construct the header manually: http://stackoverflow.com/questions/2407126/python-urllib2-basic-auth-problem – proski Sep 22 '15 at 02:32
2

The requests module provides a modern API to HTTP/HTTPS capabilities.

import requests

url = 'https://www.someserver.com/toplevelurl/somepage.htm'

res = requests.get(url, auth=('USER', 'PASSWORD'))

status = res.status_code
text   = res.text
Weston
  • 2,732
  • 1
  • 28
  • 34
1

The urllib2 documentation has an example of working with Basic Authentication:

http://docs.python.org/library/urllib2.html#examples

Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
  • How do I know which "realm" and "uri" to pass to add_password? I guess it's obvious that I don't know a whole lot about HTTP and authentication. – Parker Coates Jun 25 '09 at 22:13
  • Use the urllib2.HTTPPasswordMgrWithDefaultRealm, it doesn't need to know the realm. The realm is, as far as I understood, just a way of the server to provide a (human readable) name for the area to log into. Cheers, – Boldewyn Jun 26 '09 at 08:24