14

What’s the difference between getting text and innerHTML when using Selenium?

Even though we have text under a particular element, when we perform .text we get empty values. But doing .get_attribute("innerHTML") works fine.

What is the difference between two? When should someone use '.get_attribute("innerHTML")' over .text?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Vivek Srinivasan
  • 2,687
  • 3
  • 17
  • 17
  • 1
    "innerHTML" will return the inner HTML of this element, which contains all HTML tags inside it including text & tags like "`

    This is demo

    "` while .text will only retrieve all text content of its descendants without any HTML tags.example: "This is demo"
    – thebadguy Nov 04 '16 at 06:01
  • I can get that point....but at times....when u do the following `driver.find_element_by_css_selector("p").text` will yield nothing. but doing driver.find_element_by_css_selector("p").get_attribute("innerHTML") will result in extracting `This is demo`....why is that behavior? – Vivek Srinivasan Nov 04 '16 at 06:13
  • the problem can be with your selector...when you are using driver.find_element_by_css_selector("p").text....If you can share the url of webpage you are trying.. I can explain thing in better way – thebadguy Nov 04 '16 at 06:16
  • `"http://www.costco.com/Weatherproof%C2%AE-Men's-Ultra-Tech-Jacket.product.100106552.html"` Tried getting product title using the following line `driver.find_element_by_css_selector("h1[itemprop='name']").text` yielded nothing....but driver.find_element_by_css_selector("h1[itemprop='name']").get_attribute("innerHTML") gets me the product title `"Weatherproof\xae Men's Ultra Tech Jacket"` – Vivek Srinivasan Nov 04 '16 at 06:37
  • i updated the comment with details...by mistake I pressed enter before adding further details...My Bad – Vivek Srinivasan Nov 04 '16 at 06:40
  • I have provide the answer because "h1[itemprop='name'] selector on chrome or firefox are returning 2 matching nodes while .product-h1-container.visible-xl-block>h1 is returning only one matching node thats why its prining what is expected – thebadguy Nov 04 '16 at 06:54

5 Answers5

11

To start with, text is a property where as innerHTML is an attribute. Fundamentally there are some differences between a property and an attribute.


get_attribute("innerHTML")

get_attribute(innerHTML) gets the innerHTML of the element.

This method will first try to return the value of a property with the given name. If a property with that name doesn’t exist, it returns the value of the attribute with the same name. If there’s no attribute with that name, None is returned.

Values which are considered truthy, that is equals true or false, are returned as booleans. All other non-None values are returned as strings. For attributes or properties which do not exist, None is returned.

  • Arguments:

    innerHTML - Name of the attribute/property to retrieve.
    
  • Example:

    # Extract the text of an element.
    my_text = target_element.get_attribute("innerHTML")
    

text

text gets the text of the element.

  • Definition:

    def text(self):
        """The text of the element."""
        return self._execute(Command.GET_ELEMENT_TEXT)['value']
    
  • Example:

    # Extract the text of an element.
    my_text = target_element.text
    

Does it still sound similar? Read below...


Attributes and properties

When the browser loads the page, it parses the HTML and generates DOM objects from it. For element nodes, most standard HTML attributes automatically become properties of DOM objects.

For instance, if the tag is:

<body id="page">

then the DOM object has body.id="page".

Note: The attribute-property mapping is not one-to-one!


HTML attributes

In HTML, tags may have attributes. When the browser parses the HTML to create DOM objects for tags, it recognizes standard attributes and creates DOM properties from them.

So when an element has id or another standard attribute, the corresponding property gets created. But that doesn’t happen if the attribute is non-standard.

Note: A standard attribute for one element can be unknown for another one. For instance, type is standard attribute for <input> tag, but not for <body> tag. Standard attributes are described in the specification for the corresponding element class.

So, if an attribute is non-standard, there won’t be a DOM-property for it. In that case all attributes are accessible by using the following methods:

  • elem.hasAttribute(name): checks for existence.
  • elem.getAttribute(name): gets the value.
  • elem.setAttribute(name, value): sets the value.
  • elem.removeAttribute(name): removes the attribute.

An example of reading a non-standard property:

<body something="non-standard">
  <script>
    alert(document.body.getAttribute('something')); // non-standard
  </script>
</body>

Property-attribute synchronization

When a standard attribute changes, the corresponding property is auto-updated, and (with some exceptions) vice versa. But there are exclusions, for instance input.value synchronizes only from attribute -> to property, but not back. This feature actually comes in handy, because the user may modify value, and then after it, if we want to recover the "original" value from HTML, it’s in the attribute.


As per Attributes and Properties in Python when we reference an attribute of an object with something like someObject.someAttr, Python uses several special methods to get the someAttr attribute of the object. In the simplest case, attributes are simply instance variables.

Python Attributes

In a broader perspective:

  • An attribute is a name that appears after an object name. This is the syntactic construct. For example, someObj.name.
  • An instance variable is an item in the internal __dict__ of an object.
  • The default semantics of an attribute reference is to provide access to the instance variable. When we mention someObj.name, the default behavior is effectively someObj.__dict__['name']

Python Properties

In Python we can bind getter, setter (and deleter) functions with an attribute name, using the built-in property() function or @property decorator. When we do this, each reference to an attribute has the syntax of direct access to an instance variable, but it invokes the given method function.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
5

.text will retrieve an empty string of the text in not present in the view port, so you can scroll the object into the viewport and try .text. It should retrieve the value.

On the contrary, innerhtml can get the value, even if it is present outside the view port.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jyothishwar Deo
  • 436
  • 3
  • 10
4

For instance, <div><span>Example Text</span></div>.

.get_attribute("innerHTML") gives you the actual HTML inside the current element. So theDivElement.get_attribute("innerHTML") returns "<span>Example Text</span>".

.text gives you only text, not including the HTML node. So theDivElement.text returns "Example Text".

Please note that the algorithm for .text depends on webdriver of each browser. In some cases, such as element is hidden, you might get different text when you use a different webdriver.

I usually get text from .get_attribute("innerText") instead of .text, so I can handle the all the cases.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Buaban
  • 5,029
  • 1
  • 17
  • 33
2

Chrome (I'm not sure about other browsers) ignores the extra spaces within the HTML code and displays them as a single space.

<div><span>Example  Text</span></div> <!-- Notice the two spaces -->

.get_attribute('innerHTML') will return the double-spaced text, which is what you would see when you inspect element), while .text will return the string with only 1 space.

>>> print(element.get_attribute('innerHTML'))
'Example  Text'
>>> print(element.text)
'Example Text'

This difference is not trivial as the following will result in a NoSuchElementException.

>>> arg = '//div[contains(text(),"Example Text")]'
>>> driver.find_element_by_xpath(arg)

Similarly, .get_attribute('innerHTML') for the following returns Example&nbsp;Text, while .text returns Example Text.

<div><span>Example&nbsp;Text</span></div>
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ji Wei
  • 840
  • 9
  • 19
0

I have just selected the CSS selector and used the below code:

from selenium import webdriver

driver = webdriver.Chrome()
driver.maximize_window()
driver.get("http://www.costco.com/Weatherproof%C2%AE-Men's-Ultra-Tech-Jacket.product.100106552.html")
print driver.find_element_by_css_selector(".product-h1-container.visible-xl-block>h1").text

and it prints:

Weatherproof® Men's Ultra Tech Jacket

The problem is h1[itemprop='name'] selector on Google Chrome or Chrome are returning two matching nodes while .product-h1-container.visible-xl-block>h1 is returning only one matching node. That’s why it's printing what is expected.

To prove my point, run the below code:

from selenium import webdriver

driver = webdriver.Chrome()
driver.maximize_window()
driver.get("http://www.costco.com/Weatherproof%C2%AE-Men's-Ultra-Tech-Jacket.product.100106552.html")
x= driver.find_elements_by_css_selector("h1[itemprop='name'] ")

for i in x:
    print "This is line " , i.text

It will print

This is line
This is line  Weatherproof® Men's Ultra Tech Jacket

Because select_element_by_css_selector selects the first element with matching selector and that does not contain any text so it does not print. Hope you understand now

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
thebadguy
  • 2,092
  • 1
  • 22
  • 31
  • thats awesome ...but dont find a difference between above selection and this `driver.find_element_by_css_selector("h1[itemprop='name']").t‌​ext` ...both selects same element right?....why in the case `.text` works – Vivek Srinivasan Nov 04 '16 at 06:54
  • your selector is returning 2 matching node in which one does not contain the text & second one contains its while mine is only 1 which contains the text, so it prints it out – thebadguy Nov 04 '16 at 06:55
  • Thanks for clear explanation !!!! Do we really have two elements in the page...like one visible and one invisible...when we do inspect element could not catch it...is it something to do with browser...or again missed any trivial stuff? – Vivek Srinivasan Nov 04 '16 at 07:04
  • `reviewsCount = driver.find_elements_by_css_selector("li[itemprop='review']") reviewTitle = reviewsCount[0].find_elements_by_css_selector(".bv-content-title") reviewTitle[0].get_attribute("innerHTML")` in this case reviewTitle has got only one element .But in this case `.text` didnt work.... – Vivek Srinivasan Nov 04 '16 at 07:14
  • @VivekSrinivasan, can you explain what you to achieve by this sequence – thebadguy Nov 04 '16 at 13:46
  • I am trying to extract review title through this ....for example the code gets first review title – Vivek Srinivasan Nov 04 '16 at 17:40