How can I get the text from this HTML snippet using lxml?

Question

Can anyone explain why this snippet fails on the assert?

from lxml import etree

s = '<div><h2><img />XYZZY</h2></div>'

root = etree.fromstring(s)

elements = root.xpath(".//*[contains(text(),'XYZZY')]")  # Finds 1 element, as expected

for el in elements:
    assert el.text is not None

And then... how can I get access to "XYZZY" and change it to "ZYX"?

score 2 · Accepted Answer · edited Oct 22 '21 at 16:42

Can anyone explain why this snippet fails on the assert?

Because the text of the <h2> element is stored by lxml in one of the children of the h2 element. You can use itertext() to get what you're looking for.

from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
elements = root.xpath(".//*[contains(text(),'XYZZY')]")
for el in elements:
    el_text = ''.join(el.itertext())
    assert el_text is not None
    print(el_text)

UPDATE: After looking at this some more, it turns out each Element has 3 relevant properties: .tag, .text and .tail.

For the .tail property, there is a small part in the tutorial that explains it:

<html><body>Hello<br/>World</body></html>

Here, the
tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree

How .tail is being populated is again explained here:

LXML appends trailing text, which is not wrapped inside it's own tag, as the .tail attribute of the tag just prior.

So we can actually write the following code, to walk through each Element in the Element tree and find where the text XYZZY is located:

from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)

context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
    print("%s: %s : [text=%s : tail=%s]" % (action, elem.tag, elem.text, elem.tail))

Output:

start: div : [text=None : tail=None]
start: h2 : [text=None : tail=None]
start: img : [text=None : tail=XYZZY]
end: img : [text=None : tail=XYZZY]
end: h2 : [text=None : tail=None]
end: div : [text=None : tail=None]

So it is located in the .tail property of the <img> Element.

About your 2nd question:

And then... how can I get access to "XYZZY" and change it to "ZYX"?

One solution is to just walk the Element tree, check whether each element has the string in its text or tail, and then replace it:

#!/usr/bin/python3
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)

search_string = "XYZZY"
replace_string = "ZYX"

context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
    if elem.text and elem.text.strip() == search_string:
        elem.text = replace_string
    elif elem.tail and elem.tail.strip() == search_string:
        elem.tail = replace_string

print(etree.tostring(root).decode("utf-8"))

Output:

<div><h2><img/>ZYX</h2></div>

How can I get the text from this HTML snippet using lxml?

1 Answers1

Linked