0

I am trying to download thousands of HTML pages in order to parse them. I tried it with selenium but the downloaded file does not contain all the text seen in the browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager    

chrome_options = Options()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)

for url in URL_list:
        browser.get(url)
        content = browser.page_source
        with open(DOWNLOAD_PATH + file_name + ".html", "w", encoding='utf-8') as file:
            file.write(str(content))
browser.close()

but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" I get the full page.

URL example - https://www.camoni.co.il/411788/1Jacob

thank you

Ron Keinan
  • 27
  • 4

1 Answers1

0

Be aware that using the webdriver in headless mode may not provide the same results. For a fast resolution I suggest scraping the pages source without the --headless option.

The other way around is, perhaps, to await certain elements to be located.

I suggest getting around Expected Conditions and waits for that example.

Here's a function that I prepared for your better understanding:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
    

def awaitCertainElements_andGetSource():
    element_one = driver.find_element(By.XPATH, "//*[text() = 'some text that is crucial for you'")
    element_two = driver.find_element(By.XPATH, "//*[@id='some-id'")
    wait = WebDriverWait(driver, 5)
    wait.until(EC.visibility_of(element_one))
    wait.until(EC.visibility_of(element_two))
    
    return driver.get_source
    
rafald
  • 23
  • 4