7

I'm trying to crawl the website "http://everydayhealth.com". However, I found that the page will dynamically rendered. So, when I click the button "More", some new news will be shown. However, using splinter to click the button doesn't let "browser.html" automatically changes to the current html content. Is there a way to let it get newest html source, using either splinter or selenium? My code in splinter is as follows:

import requests
from bs4 import BeautifulSoup
from splinter import Browser

browser = Browser()
browser.visit('http://everydayhealth.com')
browser.click_link_by_text("More")

print(browser.html)

Based on @Louis's answer, I rewrote the program as follows:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox()
driver.get("http://www.everydayhealth.com")
more_xpath = '//a[@class="btn-more"]'
more_btn = WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath(more_xpath))
more_btn.click()
more_news_xpath = '(//a[@href="http://www.everydayhealth.com/recipe-rehab/5-herbs-and-spices-to-intensify-flavor.aspx"])[2]'
WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath(more_news_xpath))

print(driver.execute_script("return document.documentElement.outerHTML;"))
driver.quit()

However, in the output text, I still couldn't find the text in the updated page. For example, when I search "Is Milk Your Friend or Foe?", it still returns nothing. What's the problem?

xjmfel
  • 83
  • 2
  • 5
  • How do you check that there are no changes in the HTML? For instance, I see `5 Herbs and Spices That Boost Your Health` text inside the printed html and that is loaded after the click on `More` button. – alecxe Nov 07 '14 at 21:06
  • @alecxe Thanks for your reply. I think I check it the same way as you check it. The reason that you found "5 Herbs and Spices That Boost Your Health" appeared in the printed html is because this article happened to be shown in the thumbnail at the very top of the webpage. If you check any other title showing after clicking the button, for instance "Is Milk Your Friend or Foe?", you wouldn't find it. – xjmfel Nov 08 '14 at 05:24

2 Answers2

3

With Selenium, assuming that driver is your initialized WebDriver object, this will give you the HTML that corresponds to the state of the DOM at the time you make the call:

driver.execute_script("return document.documentElement.outerHTML;")

The return value is a string so you could do:

print(driver.execute_script("return document.documentElement.outerHTML;"))
Louis
  • 146,715
  • 28
  • 274
  • 320
  • thanks for your reply. Could you please take a look at my updated question? I followed your instruction, but the output text still doesn't have the newly generated html. – xjmfel Nov 09 '14 at 00:52
  • 1
    The problem you have is that you are getting the HTML before the page has finished updating. A very easy way to know that you have a timing issue is to use `time.sleep(...)` and put an arbitrary number of seconds that you know is big enough for the update to occur. If it works with the sleep then you know you have a timing issue. You are probably not waiting for the right thing. It looks like the more news button is put back into the page before the articles are added. This is a significantly different problem than just getting the dynamic HTML. So I would suggest... – Louis Nov 09 '14 at 00:59
  • ... letting this question as it originally was, studying the web page you are working with to see what it is you should actually be waiting for, perhaps reading some SO questions on waiting in Selenium and then posting a new question about waiting specifically if you still need help. – Louis Nov 09 '14 at 01:00
  • I should have mentioned in my first comment that I did download your code and tried it here and it is definitely a timing issue. It worked when I added `import time; time.sleep(5)` just before the `print`. – Louis Nov 09 '14 at 01:01
  • Wow, it works after inserting the statement "time.sleep(5)". Thanks for this tip! In addition, is there a more intelligent way of waiting for page to load fully rather than waiting statically? Actually I added the sentence "WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath(more_news_xpath)) " in order to examine if some new elements are shown in the updated page, but it seems not working expectedly. Thanks. – xjmfel Nov 09 '14 at 03:25
-1

When I use Selenium for tasks like this, I know browser.page_source does get updated.

myersjustinc
  • 714
  • 1
  • 7
  • 15