1

I am trying to scrape information from a website. So far, I've been able to access the webpage, log in with a username and password, and then print that landing page's page source into a separate .html/.txt file as needed.

Here's where the problems arise: on that "landing page," there's a table that I want to scrape the data from. If I were to manually right-click on any integer on that table, and select "inspect," I'd find the integer with no problem. However, when looking at the page source as a whole, I don't see the integers- just variable/parameter names. This leads me to believe it is a dynamic website.

How can I scrape the data?

I've been to hell and back trying to scrape this website, and so far, here's how the available technology has worked for me:

  • Firefox, IE, and Opera do not render the table. My guess is that this is a problem on the website's end. Only Chrome seems to work if I log in manually.
  • Selenium's Chromium package has been failing on me repeatedly (on my Windows 7 laptop) and I have even posted a question about the matter here. For now I'll assume it's just a lost cause, but I'm willing to graciously accept anyone's benevolent help.
  • Spynner's description looked promising, but that setup has frustrated me for quite some time- and the lack of a clear introduction only compounds its cumbersome nature to a novice like myself.
  • I prefer to code in Python, as it is the language I am most comfortable with. I have a pending company request to have the company install Visual Studio on my computer (to try doing this in C#), but I'm not holding my breath...

If my code can be of any use, so far, here's how I'm using mechanize:

# Headless Browsing Using PhantomJS and Selenium
#
# PhantomJS is installed in current directory
#
from selenium import webdriver
import time

browser = webdriver.PhantomJS()
browser.set_window_size(1120, 550) # need a fake browser size to fetch elements

def login_entry(username, password):
    login_email = browser.find_element_by_id('UserName')
    login_email.send_keys(username)
    login_password = browser.find_element_by_id('Password')
    login_password.send_keys(password)
    submit_elem = browser.find_element_by_xpath("//button[contains(text(), 'Log in')]")
    submit_elem.click()

browser.get("https://www.example.com")
login_entry('usr_name', 'pwd')

time.sleep(10)

test_output = open('phantomjs_test_source_output.html', 'w')
test_output.write(repr(browser.page_source))
test_output.close()

browser.quit()

p.s.- if anyone thinks I should be tagging javascript to this question, let me know. I personally don't know javascript but I'm sensing that it might be part of the problem/solution.

Community
  • 1
  • 1
daOnlyBG
  • 595
  • 4
  • 20
  • 49

1 Answers1

0

Try something like this. Sometimes with dynamic pages you need to wait for the data to load.

  from selenium.webdriver.support.wait import WebDriverWait
  from selenium.webdriver.support import expected_conditions as EC
  WebDriverWait(my_driver, my_time).until(EC.presence_of_all_elements_located(my_expected_element))

http://selenium-python.readthedocs.io/waits.html https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html

Max Paymar
  • 588
  • 1
  • 7
  • 23