1

I'm using URLlib2 (and python 2.7) to grab some content from a website. So far I have been using URLlib2 OK to get content OK, but this is the first time I've hit a website that has a password at the content level. I have a legit u:p (that I obviously can't share here) and it seems like I'm not giving the right credential somehow to my request.

I've used the method here: Python urllib2, basic HTTP authentication, and tr.im replacing (username, password) with my credentials as a string ("myUsername","myPassword")

When I print result.read() I get a blank line, and when I try print result.headers() I get:

<addinfourl at 40895752L whose fp = <socket._fileobject object at 0x00000000026757C8>>

as example, for every expected instance of the call, which I assume to mean there is a file object there of sorts...

I tried print result.info() to see if there was a header coming back, and I see a set of headers:

REDACTED
Date: Mon, 01 Oct 2012 10:06:24 GMT
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.1.6
Set-Cookie: OJSSID=mc7u47e674jmpjgk3kspfgc9l3; path=/
Refresh: 0; url=http:REDACTED loginMessage=reader.subscriptionRequiredLoginText
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

So I can take from "loginMessage=reader.subscriptionRequiredLoginText" that I've not sent the credentials properly.

Any pointers?

The calling code is:

def getArticle(newLink):
request = urllib2.Request(newLink)
base64string = base64.encodestring('%s:%s' % ("myUsername", "myPassword")).replace('\n', '')
request.add_header("Authorization", "Basic %s" % base64string)   
result = urllib2.urlopen(request)
print result.read()

and an example URL is: REDACTED - its not my website!

Community
  • 1
  • 1
Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51

2 Answers2

2

You'll find dealing with the requests library much nicer than urllib2.

Looking at the link you provided, it doesn't require Basic Auth, rather, it's a form... So you need to take the URL of the 'action' attribute of the form, and submit data to that. An example using requests:

import requests
url = 'http://www.content.alternative.ac.nz/index.php/alternative/login/signIn'
r = requests.post(url, data={'username': 'username', 'password': 'password', 'remember': '1'})

This I can't check fully (as I don't have a valid u&p), but by sending effectively ticking the "Remember Me" button, you should then get a cookie accessible via r.cookies which hopefully means that can be used for further requests such as:

cookies = r.cookies
r = requests.get('http://www.content.alternative.ac.nz/index.php/alternative/article/view/176/202', cookies=cookies)
wim
  • 338,267
  • 99
  • 616
  • 750
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thanks, thats really useful, and I'm going to (1) redact the URL info and (2) mark this as answered, pending my solution, I am confident that the solution lies in handling the form u:p properly :) Thank you for your time. – Jay Gattuso Oct 01 '12 at 10:13
1

I advice to use Requests for Humans instead of urllib2. It's much simpler in use and more obvious.

Sometimes sites doesn't support Basic HTTP authorization, that assumes sending credentials in each request's header. Instead, they require POST with credentials on login page. This POST validated on server and, if credentials are correct, server returns response with "Set-Cookie: name=value" that asks browser to save cookie. Then this cookie is used to identify authenticated client.

Seems, it's your case. In your example, you need to make POST request to http://www.content.alternative.ac.nz/index.php/alternative/login/signIn, setting parameters "login" and "password" with credentials you have. Then retrieve cookie from response and add it to next request like this.

Marboni
  • 2,399
  • 3
  • 25
  • 42
  • Oh, cool, thats lots to explore, thanks! I guess as long as you pass the cookie to each of the subsequent calls you can maintain a persistent session. Appreciate your time (and I fixed my error about getting the header data..) – Jay Gattuso Oct 01 '12 at 10:11