18

i unfortunately am not able to post code to reproduce this problem, since it involves signing into a site that is not a public site. but my question is more general than code problems. essentially, driver.page_source does not match what shows up in the browser it is driving. this is not an issue with elements not loading fully because i am testing this while executing code line by line in a python terminal. i am looking at the page source in the browser after right clicking and going to "view page source", and but if i print driver.page_source or attempt to find_element_by_[...], it shows slightly different code with entire elements missing. here is the html in question:

<nav role="navigation" class="utility-nav__wrapper--right">
<input id="hdn_partyId" value="1965629" type="hidden">
<input id="hdn_firstName" value="CHARLES" type="hidden">
<input id="hdn_sessionId" value="uHxQhlARvzA7N16uh+KJAdNFIcY6D8f9ornqoPQ" type="hidden">
<input id="hdn_cmsAlertRequest" type="hidden" value="Biennial Plus">
<ul class="h-list h-list--middle">
    [...]
</ul>

i need all 4 of the input elements, however, hdn_partyId and hdn_sessionId elements do not appear in selenium's .page_source and if i try to get them with .find_element_by_[...] i get a NoSuchElementException

i even ran a check on finding all input elements and listing them, and these 2 do not show up.

does anyone have any idea why selenium would not provide the same content as directly looking at the browser it is driving?

EDIT: to clarify... i am driving Chrome with Chromedriver through Selenium. this is not an issue with the page not fully loading. as i mentioned, i am running this manually line by line through a python terminal and not executing a script. so the browser pops up, loads the page, logs in, and then i manually check the browser's page source and see the element, then i print driver.page_source and it's not there, and if i run session_id = driver.find_element_by_id('hdn_sessionId') i get a NoSuchElementException. there are also no frames at all in the page, nor any additional windows.

crookedleaf
  • 2,118
  • 4
  • 16
  • 38
  • Did you try to use an explicit wait function before trying to get the source? – karthik006 Jul 21 '17 at 20:35
  • did you try using Chromedriver and noticing if the page load completely? and checking its source code? – 0xMH Jul 21 '17 at 20:39
  • I don't believe `driver.page_source` will retrieve any DOM elements generated by JavaScript. – Tod Jul 21 '17 at 20:41
  • i did not try to use an explicit wait, since this isn't an issue with the page loading completely. this is something where i am executing the code in a python terminal. i am using selenium to drive Chrome. so the page is fully loaded, i look at it, right click, view page source, see the elements, then `print driver.page_source` and these elements are not there. – crookedleaf Jul 21 '17 at 20:44
  • Do you mean that you tried `find_element_...()` from Python console after page is completely rendered and was able to find only 2 of 4 inputs? P.S. Ignore `driver.page_source` – Andersson Jul 21 '17 at 20:44
  • @Andersson correct, but it's visible if i right click the browser's page and click "view page source" – crookedleaf Jul 21 '17 at 20:49
  • @crookedleaf, and all those elements located just like on provided piece of `HTML` and have the same parent? If you `print(len(driver.find_elements_by_xpath('//input[starts-with(@id, "hdn_")]')))` you also get `2` as output? – Andersson Jul 21 '17 at 20:58
  • @Andersson yes, that code was copy-pasted straight from the browser that selenium is driving. and the `len` does return `2`. – crookedleaf Jul 21 '17 at 21:04
  • Calling `JavaScript` directly sometimes solves hopeless issues like that (at least you can try). Check `driver.execute_script('return document.getElementById("hdn_sessionId");').get_attribute("value")`... – Andersson Jul 21 '17 at 21:14
  • Try to identify the root of the issue from the browser's console by calling `$0.innerHTML` on a selected element. The only case I know of where a node doesn't appear in the page source is when the element is a frame or a shadow DOM. You should also try with Firefox to see if the issue is related to Chrome. – Florent B. Jul 21 '17 at 21:29
  • @Andersson i did try the javascript method as well, as i saw it mentioned in another post, but `return document.getElementById("hdn_sessionId")` itself returns `None`, and running the exact code raises `AttributeError: 'NoneType' object has no attribute 'get_attribute'`. it's crazy, i've been working heavily with selenium for 7 years now and have never ran into an issue like this. – crookedleaf Jul 21 '17 at 21:43
  • I've had a problem like this when trying to click Google's ReCaptcha box. Javascript only worked some of the time for me. It was weird. – whackamadoodle3000 Jul 21 '17 at 21:47
  • @FlorentB. i tried Firefox and am having the same issue – crookedleaf Jul 21 '17 at 21:56

4 Answers4

15

A coworker of mine has figured out the issue and a workaround. Essentially, after the page is done loading, it runs a javascript command that cleans up the DOM. What the "view page source" in the browser shows is not what the current state is. So running print driver.page_source or using any form of driver.find_element_by_[...] is pulling from the newest and freshest page data, while the browser's "view page source" only shows what was provided when the page first loaded. If you start 'inspecting' the page in Chrome, you will see the HTML is different than what the browser says the "page source" is. After reverse engineering the Javascript, we are able to run partyid = driver.execute_script('return accountdata.$partyId.val();') and get what was originally assigned. I hope this is enough info to help other people who may run into this issue in the future.

crookedleaf
  • 2,118
  • 4
  • 16
  • 38
  • 2
    The "view page source" from the context menu displays the HTML returned by the server while the command `driver.page_source` returns the actual HTML built by the browser. I guess we all assumed that you were talking about the source displayed in the "Element" tab from Developer Tools ("Inspect" from the context menu). It's not really an issue, you were just looking at the wrong place. So in the end, the HTML returned by `driver.page_source` **does match** what shows up in the browser it is driving. – Florent B. Jul 22 '17 at 01:58
  • i guess so, but i didn't understand at the time that the browser didn't show the HTML that was actually being displayed when you right click and click "view page source". but i did say quite a few times that i was right clicking in the browser and clicking "view page source", and never mentioned anything about "inspect element", so i'm a bit surprised people assumed something else when i said "right clicking in the browser and clicking view source". but today i learned... lol – crookedleaf Jul 22 '17 at 21:06
5

try like this you will get source code keyword "view-source:" which can be different according to your browser this is for the chrome

driver.get("view-source:"+url)

sourcecode=driver.find_element_by_tag_name('body').text
Nensi Kasundra
  • 1,980
  • 6
  • 21
  • 34
yash shah
  • 59
  • 1
  • 2
  • This was the only solution which worked for me when trying to get the content from a – Tobias P. G. Oct 10 '22 at 10:51
0

If you locate the 'body' of the page then use get_attribute('innerHTML') you can access everything from the page.

Ger Mc
  • 630
  • 3
  • 11
  • 22
-1

Quite often when using selenium, waiting does the trick without needing a lot of extra code (i.e. giving a few seconds for the full DOM to load). So in the example below, the HTML that was gathered reflected what one would see when one 'inspects' as opposed to using 'view source', which displayed pre-JS DOM

from time import sleep
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get(url)
sleep(10)
HTML = driver.page_source
  • i know this question is quite a few years old, but as the post had mentioned, waiting wasn't the issue. i was manually debugging the code, and executing the code line by line in a python terminal. the problem was a javascript function ran after the page load was completed, which cleared out a bunch of elements from the DOM. the accepted answer breaks down what caused the issue and what the workaround was. – crookedleaf Nov 11 '21 at 17:59