0

I'm in the process of extracting some HTML code using "Mechanize". However, I'm having a problem with the HTML code outputted. Essentially, it seems like Mechanize is replacing the content inside certain elements to '(n/a)'.

Example (structure shown in Firebug)

<tr>
    <td>
        <img class="bullet" src="images/bulletorange.gif" alt="">
        <span class="detailCaption">Video Format Mode:</span>
        <span class="settingValue" id="vidSdSdiAnlgFormatSelectionMode.1.1">Auto</span>
    </td>
</tr>

Example (structure output by Mechanize)

<tr>
    <td>
        <img class='bullet' src='images/bulletorange.gif' alt='' />
        <span class='detailCaption'>Video Format Mode:</span>
        <span class='settingValue' id="vidSdSdiAnlgFormatSelectionMode.1.1">(n/a)</span>
    </td>
</tr>

The problem is that "Auto" is being replaced by "(n/a)". I'm not really sure why!

Please help. Why is mechanize doing this? And how can I fix it?

Below my code...

def login_and_return_html(self, url_login, url_after_login, form_username, form_password, username, password):
    """
    Description: Returns html code form a website that requires login.

    Input Arguments: url_login (str)-The url where you enter the login username and password
                     url_after_login (str)-The url where you want to go after you login
                     form_username (str)-The name of the form for the username input field
                     form_password (str)-The name of the form for the password input field
                     username (str)-The actual username
                     password (str)- The actual password

    Return or Output: Returns HTML code of the 'url_after_login' page

    Modules and Classes: mechanize
                         ssl
    """
    try:  # Unabling SSL certificate validation
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:  # Legacy Python that doesn't verify HTTPS certificates by default
        pass
    else:  # Handle target environment that doesn't support HTTPS verification
        ssl._create_default_https_context = _create_unverified_https_context

    br = mechanize.Browser()  # Browser

    br.set_handle_equiv(True)  # Browser options
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)

    cj = mechanize.CookieJar()  # Cookie Jar
    br.set_cookiejar(cj)

    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),
                          max_time=1)  # Follows refresh 0 but not hangs on refresh > 0

    br.open(url_login)  # Login
    br.select_form(nr=0)
    try:
        br.form[form_username] = username                                                                            #Fill in the blank username form
        br.form[form_password] = password                                                                            #Fill in the blank password form
        br.submit()
    except:
        control = br.form.find_control(form_username)
        for item in control.items:                                                                                  #Dropdown menu username form
            if item.name == username:
                item.selected = True
        br.form[form_password] = password                                                                           #Fill in the blank password form
        br.submit()

    html = br.open(url_after_login).read()
    return html
Sebastian Zartner
  • 18,808
  • 10
  • 90
  • 132

2 Answers2

1

Why is mechanize doing this?

Mechanize probably isn't but the browser is. My guess is that the site uses Javascript which is not supported with mechanize and thus you get the HTML in it's original form, i.e. the content before any Javascript got executed.

And how can I fix it?

Not with mechanize but you need some solution which supports Javascript. See Mechanize and Javascript for more information and possible solutions.

Community
  • 1
  • 1
Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
0

Here is the solution to how I was able to obtain both, the HTML and the Javascript code.

I used the selenium library.

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time

#Using Firefox 48.0.2 and the new WebDriver
caps = DesiredCapabilities.FIREFOX
caps["marionette"] = True
br = webdriver.Firefox(capabilities=caps)
br.get('http://XXX.XXX.XXX.XXX/')

#Input Username and Password
username = br.find_element_by_name('SOME_NAME')
username.send_keys('USERNAME')
password = br.find_element_by_name('SOME_NAME')
password.send_keys('PASSWORD')
form = br.find_element_by_name('submitButton')
form.submit()
time.sleep(20)

#THIS IS WHAT IS DIFFERENT...
td_element = br.find_element_by_xpath('/html')
html = br.execute_script("return arguments[0].innerHTML;", td_element)
print html