Selenium download entire html

Question

I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.

I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.

I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.

How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.

If you have an alternative idea please also share.

My driver declaration

 from selenium import webdriver
 from selenium.common.exceptions import WebDriverException

 driverPath = '/usr/lib/chromium-browser/chromedriver'

 def create_web_driver():
     options = webdriver.ChromeOptions()
     options.add_argument('headless')

     # set the window size
     options.add_argument('window-size=1200x600')

     # try to initalize the driver
     try:
         driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
     except WebDriverException:
         print("failed to start driver at path: " + driverPath)

     return driver

My url call my timeout = 20

 driver.get(url)
 time.sleep(timeout)
 content = driver.page_source

 content = content.encode('utf-8')
 hashed_content = hashlib.sha512(content).hexdigest()

^ getting different hash here every time since same url not producing same web page

score 1 · Answer 1 · answered Oct 08 '18 at 06:31

In my experience time.sleep() does not work well with dynamic loading times. If the page is javascript-heavy you have to use the WebDriverWait clause.

Something like this:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get(url)

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[my-attribute='my-value']")))

Change 10 with whatever timer you want, and By.CSS_SELECTOR and its value with whatever type you want to use as a reference for a lo

You can also wrap the WebDriverWait around a Try/Except statement with the TimeoutException exception, which you can get from the submodule selenium.common.exceptions in case you want to set a hard limit.

You can probably set it inside a while loop if you truly want it to check forever until the page's loaded, because I couldn't find any reference in the docs about waiting "forever", but you'll have to experiment with it.

score 1 · Accepted Answer · answered Oct 08 '18 at 07:05

As the Application Under Test(AUT) is based on Angular, React, Vue in that case Selenium seems to be the perfect choice.

Now, as you are fine with the fact that some content isn't loaded from lazy loading because of not scrolling makes the usecase feasible. But in all possible ways ...do not have time to find an element on them to make sure it is loaded... can't be really compensated inducing time.sleep() as time.sleep() have certain drawbacks. You can find a detailed discussion in How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM will be different for all the 1000 odd web pages.

Solution

A couple of viable solutions:

A pottential solution could have been to induce WebDriverWait and ensure that some HTML elements are loaded as per the discussion How can I make sure if some HTML elements are loaded for Selenium + Python? validating atleast either of the following:
- Page Title
- Page Heading
Another solution would be to tweak the capability pageLoadStrategy. You can set the pageLoadStrategy for all the 1000 odd web pages to common point assigning a value either:
- normal (full page load)
- eager (interactive)
- none
You can find a detailed discussion in How to make Selenium not wait till full page load, which has a slow script?

If you implement pageLoadStrategy, page_source method will be triggered at the same tripping point and possibly you would see identical hashed_content.

Selenium download entire html

2 Answers2

Solution