Selenium page_source truncated

Question

I would like to have the source code of this page for example : https://paris-sportifs.pmu.fr/event/699032

Below is my code snippet:

opts = Options()
opts.add_argument("Host=[paris-sportifs.pmu.fr]")
capabilities = DesiredCapabilities.FIREFOX
capabilities["marionette"] = True
browser = webdriver.Firefox(options=opts, capabilities=capabilities)
browser.get(event_url)
time.sleep(3)
div_list = browser.find_elements_by_class_name('table--header--inner.collapsed')
for item in div_list:
    browser.execute_script("arguments[0].click();", item)

The click function is working properly, and if I click on show source code within the selenium browser, the source code is the one expected. But if I do this :

html = browser.page_source
print(len(browser.page_source))

The result of the len is 1472551 instead of more than 4000000.

I've tried to wait a while (even watched using a while True loop to see if the result is changing), but without any effect at all.

Any idea ?

Thanks

score 0 · Answer 1 · answered Nov 18 '20 at 16:40

From the doc (https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/remote/RemoteWebDriver.html#getPageSource() )

Description copied from interface: WebDriver Get the source of the last loaded page. If the page has been modified after loading (for example, by Javascript) there is no guarantee that the returned text is that of the modified page. Please consult the documentation of the particular driver being used to determine whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM: do not expect it to be formatted or escaped in the same way as the response sent from the web server. Think of it as an artist's impression.

So you may not be getting the entire html of the DOM as you are expecting. To do this you may have to fetch it in other ways such as the one metioned in this comment: Python Selenium accessing HTML source

Hi, Thank you for your answer. I've switched to PhantomJS but this one is depreciated... So it doesn't seems to be a good option here... — geoff111, Nov 19 '20 at 19:50
yep, either the source page isn't refreshed, retrieving the outerHTML xpath isn't working either — geoff111, Nov 19 '20 at 20:28

Selenium page_source truncated

1 Answers1

Linked