Can anyone explain why this snippet fails on the assert?
Because the text of the <h2>
element is stored by lxml in one of the children of the h2
element. You can use itertext()
to get what you're looking for.
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
elements = root.xpath(".//*[contains(text(),'XYZZY')]")
for el in elements:
el_text = ''.join(el.itertext())
assert el_text is not None
print(el_text)
UPDATE: After looking at this some more, it turns out each Element has 3 relevant properties: .tag
, .text
and .tail
.
For the .tail
property, there is a small part in the tutorial that explains it:
<html><body>Hello<br/>World</body></html>
Here, the
tag is surrounded by text. This is often referred to as
document-style or mixed-content XML. Elements support this through their
tail property. It contains the text that directly follows the element,
up to the next element in the XML tree
How .tail
is being populated is again explained here:
LXML appends trailing text, which is not wrapped inside it's own tag, as the .tail
attribute of the tag just prior.
So we can actually write the following code, to walk through each Element in the Element tree and find where the text XYZZY
is located:
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
print("%s: %s : [text=%s : tail=%s]" % (action, elem.tag, elem.text, elem.tail))
Output:
start: div : [text=None : tail=None]
start: h2 : [text=None : tail=None]
start: img : [text=None : tail=XYZZY]
end: img : [text=None : tail=XYZZY]
end: h2 : [text=None : tail=None]
end: div : [text=None : tail=None]
So it is located in the .tail
property of the <img>
Element.
About your 2nd question:
And then... how can I get access to "XYZZY" and change it to "ZYX"?
One solution is to just walk the Element tree, check whether each element has the string in its text or tail, and then replace it:
#!/usr/bin/python3
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
search_string = "XYZZY"
replace_string = "ZYX"
context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
if elem.text and elem.text.strip() == search_string:
elem.text = replace_string
elif elem.tail and elem.tail.strip() == search_string:
elem.tail = replace_string
print(etree.tostring(root).decode("utf-8"))
Output:
<div><h2><img/>ZYX</h2></div>