EDIT (10/30): solution found at the bottom of this post.
Hello everyone,
I'm new to the 'web-scraping' scene, and have been attempting to scrape data from pages at GISIS with Python. Though I originally attempted to do this with requests
, D8Amonk's post on SO led me to mechanize
, which has worked very well for the most part.
I was able to bypass the initial 403 Errors that I was receiving by adding the headers found on kumar's post, but now face the issue of being unable to get past the log-in screen for GISIS to its actual, relevant webpages.
Julian Todd's wonderful post at ScraperWiki has helped me immensely with understanding how to disable annoying submission controls and dealing with the page's _doPostBack() mechanism. Unfortunately, the log-in page still ignores mechanize's attempts at completing its form submission - it doesn't recognize that an authority, username, and password has been entered.
My code snippets follow below:
import os
import sys
import webbrowser
import mechanize
import urllib2
import cookielib
from bs4 import BeautifulSoup
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib2.Request('https://gisis.imo.org/Public/SHIPS/Default.aspx', None, header)
...
jar = cookielib.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(jar)
browser.set_handle_robots(False)
browser.open(request)
browser.select_form(nr=0)
browser.form.set_all_readonly(False)
browser.form['ctl00$cpMain$ddlAuthorityType'] = ['PUBLIC']
browser.form['ctl00$cpMain$txtUsername'] = username
browser.form['ctl00$cpMain$txtPassword'] = password
browser.find_control('ctl00$cpMain$cbxRemember').selected = False
browser.find_control('ctl00$cpMain$btnRegister').disabled = True
browser["__EVENTTARGET"] = "lnkNext"
browser["__EVENTARGUMENT"] = ""
resp = browser.submit()
print '-- Request Made Successfully --'
return resp.read()
resp.read()
is then written to a .HTML file and opened in Firefox. Commenting and uncommenting the browser.form[...]
lines has led to an interesting discovery: if the Authority (in this case, "Public") is included in the form submission, then the webpage will recognize the Authority, but complain that a username and password must be entered.
However, if the Authority line is commented-out, then the produced webpage will recognize that the username and password have been entered, but will ask for the Authority to be selected (in this case, the username field will be filled out correctly, but the password field will be blank; I'm not sure if this is desirable or intended behavior). Similarly, as long as the Authority line is still commented-out, then I can comment out either the username or password line in my code and the resulting webpage will ask for the Authority and whatever other field had been commented out (i.e. if I only submit the password, then the page will ask for an Authority and username).
Does anybody have any suggestions for what I may be doing wrong, or where else to look? This seems like a rather unusual issue -- searching on Google has failed to yield any similar issues that other individuals have experienced.
P.S. This is my first post on StackOverflow. I tried to attach images to explain the scenarios that I've described, but apparently lack the rep necessary to post images. I apologize profusely if I've been excessively verbose or done something wrong i.e. formatting my post -- please correct me!!
EDIT (10/30): Came back to this project after moving on to other things and figured out a solution. Solution below:
This was actually not as complicated to fix as I would have thought it was. Modifying __EVENTTARGET
and __EVENTARGUMENT
was unnecessary. Instead, the __VIEWSTATE
and __VIEWSTATEGENERATOR
both needed to be modified. The correct values to use were found through the examination of successful POST requests being made in Firebug. Example code is as follows:
browser.form['__VIEWSTATE'] = 'blablabla'
browser.form['__VIEWSTATEGENERATOR'] = 'blablabla'
Modifying both values successfully logs me in to the main page. I hope this helps someone!