0

How can you get content of protected pages using Python and urllib2?

I need to specify a username and password for the pages that I am trying to retrieve.. e.g.

content = urllib2.urlopen(URL, username, password).read()

I know that is not part of the urllib2 API.. Just give an example of what I would need, from the API.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Veni_Vidi_Vici
  • 291
  • 1
  • 6
  • 16
  • 1
    HTML pages are usually protected in on of two ways; using a cookie token or using the HTTP Authentication header. You need to figure out which, then either get that cookie (usually by POSTing the username and password to a specific login form action), or by adding a [Authorization header](http://stackoverflow.com/questions/635113/python-urllib2-basic-http-authentication-and-tr-im) – Martijn Pieters May 22 '13 at 06:55

2 Answers2

2

I suggest you look at the python requests library.

It has great support for basic http authentication out of the box.

e.g.

import requests
content = requests.get(URL, auth=('user', 'pass'))

Using requests you can also set up sessions (for cookie management) and easily POST data (e.g. a login form) and keep the cookie to browse all the pages only accessible to logged in users.

Read more about session objects and posting data in the excellent documentation.

If you absolutely have to use urllib2 here's a useful snippet taken from another thread for basic HTTP authentication:

import urllib2, base64

request = urllib2.Request("http://api.foursquare.com/v1/user")
base64string = base64.standard_b64encode('%s:%s' % (username, password))
request.add_header("Authorization", "Basic %s" % base64string)   
result = urllib2.urlopen(request)
Ewan
  • 14,592
  • 6
  • 48
  • 62
0

You can do this with urllib2 just look at the Urllib docs

its actually a lot easier to enter into form using a web driver like selenium but the thing about selenium is it opens an actual window while urllib is in the background but selenium is much easier to use

Selenium API

those are just some suggestions that i hope helped

Serial
  • 7,925
  • 13
  • 52
  • 71