2

I'm trying to log in to Wikipedia using a python script, but despite following the instructions here, I just can't get it to work.

import urllib
import urllib2
import cookielib

username = 'myname'
password = 'mypassword'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6")]
login_data = urllib.urlencode({'wpName' : username, 'wpPassword' : password})
opener.open('http://en.wikipedia.org/w/index.php?title=Special:UserLogin', login_data)
resp = opener.open('http://en.wikipedia.org/wiki/Special:Watchlist')

All I get is the "You're not logged in" page. I tried logging in to another site with the script with the same negative result. I suspect it's either got something to do with cookies, or I'm missing something incredibly simple here. But I just cannot find it.

Community
  • 1
  • 1
Conti
  • 111
  • 2
  • 6
  • 2
    Try using WireShark or similar tool to inspect all the packets when logging through the website, where you should see what the web app is actually sending to the server. – LavaScornedOven Sep 22 '12 at 19:57
  • You can use either `live http header firefox` or `chrome developer tools` to see what all requests are sent once you click on the login button. As I see you are missing a couple of things in `login_data`. – RanRag Sep 22 '12 at 20:29
  • Hmm, so that means I need to get a token first and send that along with my username and password? – Conti Sep 22 '12 at 20:49
  • @Conti that is correct, you will need to parse that token somehow. I am using BeautifulSoup in my example below. – K Z Sep 22 '12 at 22:58

5 Answers5

2

If you inspect the raw request sent to the login URL (with the help of a tool such as Charles Proxy), you will see that it is actually sending 4 parameters: wpName, wpPassword, wpLoginAttempt and wpLoginToken. The first 3 are static and you can fill them in anytime, the 4th one however needs to be parsed from the HTML of the login page. You will need to post this value you parsed, in addition to the other 3, to the login URL to be able to login.

Here is the working code using Requests and BeautifulSoup:

import requests
from bs4 import BeautifulSoup as bs


def get_login_token(raw_resp):
    soup = bs(raw_resp.text, 'lxml')
    token = [n.get('value', '') for n in soup.find_all('input')
             if n.get('name', '') == 'wpLoginToken']
    return token[0]

payload = {
    'wpName': 'my_username',
    'wpPassword': 'my_password',
    'wpLoginAttempt': 'Log in',
    #'wpLoginToken': '',
    }

with requests.session() as s:
    resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
    payload['wpLoginToken'] = get_login_token(resp)

    response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
                           data=payload)
    response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')
K Z
  • 29,661
  • 8
  • 73
  • 78
  • Awesome, that did the trick! Thank you very much. :) I'm already using BeautifulSoup for my parsing, but I didn't know about Requests. It looks so much smoother than urllib. Gonna use that from now on. – Conti Sep 23 '12 at 09:05
  • @Conti Glad to help, and YES `Requests` is awesome :) – K Z Sep 23 '12 at 09:07
  • @KayZhu - switch the [] access methods to .get() to allow for non-existing attributes; specifically some forms don't assign "name" attributes to all input elements - so `n.get('name','')` handles these without throwing an KeyMissing exception – jmetz Nov 24 '14 at 13:01
2

Adding these two lines

r = bs(response.content)
print r.get_text()

I should be able to understand if I'm logged in or not, right? I keep seeing "Please log in to view or edit items on your watchlist." but I'm using the clean code given above, with my login and password.

Where is the mistake?

foebu
  • 1,365
  • 2
  • 18
  • 35
2

Wikipedia now forces HTTPS and requires other parameters, and wpLoginAttempt became wploginattempt, here is an updated version of K Z initial answer:

import requests
from bs4 import BeautifulSoup as bs


def get_login_token(raw_resp):
    soup = bs(raw_resp.text, 'lxml')
    token = [n.get('value', '') for n in soup.find_all('input')
             if n.get('name', '') == 'wpLoginToken']
    return token[0]

payload = {
    'wpName': 'my_username',
    'wpPassword': 'my_password',
    'wploginattempt': 'Log in',
    'wpEditToken': "+\\",
    'title': "Special:UserLogin",
    'authAction': "login",
    'force': "",
    'wpForceHttps': "1",
    'wpFromhttp': "1",
    #'wpLoginToken': '',
    }

with requests.session() as s:
    resp = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin')
    payload['wpLoginToken'] = get_login_token(resp)

    response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
                           data=payload)
    response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
Antoine Dusséaux
  • 3,740
  • 3
  • 23
  • 28
0

You need to add header Content-Type: application/x-www-form-urlencoded to your POST request.

seriyPS
  • 6,817
  • 2
  • 25
  • 16
0

I also added the following lines and see myself as not logged in.

page = response.text.encode('utf8')

if page.find('Not logged in'):
    print 'You are not logged in.  :('
else:
    print 'YOU ARE LOGGED IN!  :)'
acrider
  • 396
  • 2
  • 5