58

Consider:

<div id="a">This is some
   <div id="b">text</div>
</div>

Getting "This is some" is nontrivial. For instance, this returns "This is some text":

driver.find_element_by_id('a').text

How does one, in a general way, get the text of a specific element without including the text of its children?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
josh
  • 9,038
  • 8
  • 31
  • 37
  • 1
    So for the record what I ended up doing was doing it in javascript... I have jQuery on the pages I'm testing, so I took advantage of the fact that Selenium automatically converts dom elements returned from javascript into WebElements: my_result = driver.execute_script('return [...call to my jquery function..]') – josh Sep 10 '12 at 19:03

5 Answers5

30

Here's a general solution:

def get_text_excluding_children(driver, element):
    return driver.execute_script("""
    return jQuery(arguments[0]).contents().filter(function() {
        return this.nodeType == Node.TEXT_NODE;
    }).text();
    """, element)

The element passed to the function can be something obtained from the find_element...() methods (i.e., it can be a WebElement object).

Or if you don't have jQuery or don't want to use it, you can replace the body of the function above with this:

return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
    if (child.nodeType === Node.TEXT_NODE)
        ret += child.textContent;
    child = child.nextSibling;
}
return ret;
""", element)

I'm actually using this code in a test suite.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Louis
  • 146,715
  • 28
  • 274
  • 320
  • right, what I basically realized is... don't use selenium's search methods, just use jquery – josh Sep 26 '13 at 23:59
  • 2
    @josh, I would disagree with that... Seleniums methods are meant to mock interactions from a user's POV whereas jQuery is not. Yes you can use both to grab elements but in general there should be relatively few situations where you'd need to execute javascript. – wlingke Dec 16 '13 at 15:46
  • 1
    The first code snippet assumes jQuery is loaded in the page. The 2nd code snippet works whether or not jQuery is loaded. – Louis Apr 21 '16 at 12:40
14

In the HTML which you have shared:

<div id="a">This is some
   <div id="b">text</div>
</div>

The text This is some is within a text node. To depict the text node in a structured way:

<div id="a">
    This is some
   <div id="b">text</div>
</div>

This use case

To extract and print the text This is some from the text node using Selenium's client, you have two ways as follows:

  • Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:

  • using xpath:

    print(driver.find_element_by_xpath("//div[@id='a']").get_attribute("innerHTML").splitlines()[0])
    
  • using css_selector:

    print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
    
  • Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:

  • using xpath and firstChild:

    parent_element = driver.find_element_by_xpath("//div[@id='a']")
    print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
    
  • using xpath and childNodes[n]:

    parent_element = driver.find_element_by_xpath("//div[@id='a']")
    print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
4

Use:

def get_true_text(tag):
    children = tag.find_elements_by_xpath('*')
    original_text = tag.text
    for child in children:
        original_text = original_text.replace(child.text, '', 1)
    return original_text
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
josh
  • 9,038
  • 8
  • 31
  • 37
  • 3
    this runs disgustingly slowly, though... there has to be a better way?? – josh Sep 07 '12 at 21:39
  • You should always try to get the most specific child element you can. In this case, if you've got a lot of children elements it'll run slow. Why don't you check if the element actually has text before returning, i.e make the XPath: `*[string-length(text()) > 1]` or make the for loop check for `child.text` being not null and not empty. Also, what about CSS selector? XPath queries are very slow anyway, so maybe a CSS selector will be faster. – Arran Sep 07 '12 at 23:53
3

You don't have to do a replace. You can get the length of the children text, subtract that from the overall length, and slice into the original text. That should be substantially faster.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
kreativitea
  • 1,741
  • 12
  • 14
2

Unfortunately, Selenium was only built to work with Elements, not Text nodes.

If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.

One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like Beautiful Soup that can handle text nodes more elegantly.

import bs4
from bs4 import BeautifulSoup

inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')

outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')

From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.

Here's a simple one-liner that may be sufficient:

inner_soup.find(text=True)

If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.

Beautiful Soup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.

contents = inner_soup.contents

for bs4_object in contents:

    if (type(bs4_object) == bs4.Tag):
        print("This object is an Element.")

    elif (type(bs4_object) == bs4.NavigableString):
        print("This object is a Text node.")

Note that Beautiful Soup doesn't support XPath expressions. If you need those, then you can use some of the workarounds in this question.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Pikamander2
  • 7,332
  • 3
  • 48
  • 69