I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.
I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.
I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.
How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.
If you have an alternative idea please also share.
My driver declaration
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
driverPath = '/usr/lib/chromium-browser/chromedriver'
def create_web_driver():
options = webdriver.ChromeOptions()
options.add_argument('headless')
# set the window size
options.add_argument('window-size=1200x600')
# try to initalize the driver
try:
driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
except WebDriverException:
print("failed to start driver at path: " + driverPath)
return driver
My url call my timeout = 20
driver.get(url)
time.sleep(timeout)
content = driver.page_source
content = content.encode('utf-8')
hashed_content = hashlib.sha512(content).hexdigest()
^ getting different hash here every time since same url not producing same web page