How to scrape a web page that requires log in with python?

Question

I am quite new at web scraping and I would like you to shed some light on my problem. I have found several articles regarding my problem however I can't seem to get that working. The closest tutorial I've followed is this one. How to scrape a website that requires login first with Python

I am trying to scrape the following site: http://amigobulls.com/stocks/GE/income-statement/quarterly

My goal is to scrape the download link for "download General Electric financial statements". In order to achieve that, it requires login. However I can't seem to get the login bit working.

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
br.open('http://amigobulls.com/stocks/GE/income-statement/quarterly')

for f in br.forms():
    print f
br.select_form(nr=0)    
req = urllib2.Request(url, headers=hdr)   
# User credentials
br.form['pass'] = '______'
br.select_form(nr=1)
br.form['name'] = '______'
for f in br.forms():
    print f
# Login
br.submit()

print br.open('http://amigobulls.com/stocks/GE/income-statement/quarterly').read()

The response I got is as follow

<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
  <TextControl(<None>=)>
  <PasswordControl(pass=)>
  <CheckboxControl(remember_me=[*1])>
  <SubmitControl(<None>=Login) (readonly)>
  <TextControl(<None>=)>>
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
  <TextControl(name=)>
  <PasswordControl(<None>=)>
  <PasswordControl(<None>=)>
  <SubmitControl(<None>=Join Us) (readonly)>>
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
  <TextControl(<None>=)>
  <PasswordControl(pass=______)>
  <CheckboxControl(remember_me=[*1])>
  <SubmitControl(<None>=Login) (readonly)>
  <TextControl(<None>=)>>
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
  <TextControl(name=______)>
  <PasswordControl(<None>=)>
  <PasswordControl(<None>=)>
  <SubmitControl(<None>=Join Us) (readonly)>>

and followed by the HTML code for the site that is not logged in.

Should I succeed, I should be able to find the download link.

Can anyone help? Thank you so much!

another library that could achieve this functionality (with much simpler code) is `selenium`, just so you know. — n1c9, Apr 08 '16 at 17:29
Hi thanks for your reply. I have tried Selenium, yet got stuck at the whole webdriver problem that I don't know how to get around. — DLeung, Apr 11 '16 at 01:22
When I set the Selenium up it says: `WebDriverException: Message: 'wires' executable needs to be in PATH.` And in addition, as Selenium seems to require me to download the webdrivers from different places, I would like to make my program a bit more... plug and play friendly? I am sorry if this seems ridiculous to you. — DLeung, Apr 11 '16 at 05:32

How to scrape a web page that requires log in with python?

0 Answers0