I am trying to scrape information from a website. So far, I've been able to access the webpage, log in with a username and password, and then print that landing page's page source into a separate .html/.txt file as needed.
Here's where the problems arise: on that "landing page," there's a table that I want to scrape the data from. If I were to manually right-click on any integer on that table, and select "inspect," I'd find the integer with no problem. However, when looking at the page source as a whole, I don't see the integers- just variable/parameter names. This leads me to believe it is a dynamic website.
How can I scrape the data?
I've been to hell and back trying to scrape this website, and so far, here's how the available technology has worked for me:
- Firefox, IE, and Opera do not render the table. My guess is that this is a problem on the website's end. Only Chrome seems to work if I log in manually.
- Selenium's
Chromium
package has been failing on me repeatedly (on my Windows 7 laptop) and I have even posted a question about the matter here. For now I'll assume it's just a lost cause, but I'm willing to graciously accept anyone's benevolent help. Spynner
's description looked promising, but that setup has frustrated me for quite some time- and the lack of a clear introduction only compounds its cumbersome nature to a novice like myself.- I prefer to code in Python, as it is the language I am most comfortable with. I have a pending company request to have the company install Visual Studio on my computer (to try doing this in C#), but I'm not holding my breath...
If my code can be of any use, so far, here's how I'm using mechanize
:
# Headless Browsing Using PhantomJS and Selenium
#
# PhantomJS is installed in current directory
#
from selenium import webdriver
import time
browser = webdriver.PhantomJS()
browser.set_window_size(1120, 550) # need a fake browser size to fetch elements
def login_entry(username, password):
login_email = browser.find_element_by_id('UserName')
login_email.send_keys(username)
login_password = browser.find_element_by_id('Password')
login_password.send_keys(password)
submit_elem = browser.find_element_by_xpath("//button[contains(text(), 'Log in')]")
submit_elem.click()
browser.get("https://www.example.com")
login_entry('usr_name', 'pwd')
time.sleep(10)
test_output = open('phantomjs_test_source_output.html', 'w')
test_output.write(repr(browser.page_source))
test_output.close()
browser.quit()
p.s.- if anyone thinks I should be tagging javascript
to this question, let me know. I personally don't know javascript but I'm sensing that it might be part of the problem/solution.