Get the entire parent tag's text in ElementTree

Question

While using xml.etree.ElementTree as ET python package, I would like to get the entire text within an XML tag, which contains some child nodes. Consider the following xml:

<p>This is the start of parent tag...
        <ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 
</p>

Assuming that the above XML is in node, then node.text would just give me This is the start of parent tag.... However, I want to capture all of the text inside p tag (along with its child tag's texts) which would result in: This is the start of parent tag... child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2.

Is there any work-around for this issue? I looked into the documentation but couldn't really find something that works out.

score 2 · Answer 1 · answered Mar 06 '20 at 20:28

2

You can do something similar with ElementTree:

import xml.etree.ElementTree as ET
data = """[your string above]"""
tree = ET.fromstring(data)
print(' '.join(tree.itertext()).strip())

Output:

This is the start of parent tag...
         child 1 . blah1 blah1 blah1  child2  blah2 blah2 blah2

answered Mar 06 '20 at 20:28

Jack Fleeting

24,385
6
23
45

Right, I did not even see this was about `xml.etree` :-). +1! – Mathias Müller Mar 06 '20 at 20:44

score 1 · Accepted Answer · answered Mar 06 '20 at 20:16

This is indeed a very awkward peculiarity of ElementTree. The gist is: if an element contains both text and child elements, and if a child element intervenes between different intermediate text nodes, the text after the child element is said to be this element's tail instead of its text.

In order to collect all text that is an immediate child or descendant of an element, you would need to access the text and tail of this element, and of all descendant elements.

>>> from lxml import etree

>>> s = '<p>This is the start of parent tag...<ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 </p>'

>>> root = etree.fromstring(s)
>>> child1, child2 = root.getchildren()

>>> root.text
'This is the start of parent tag...'

>>> child1.text, child1.tail
('child 1', '. blah1 blah1 blah1 ')

>>> child2.text, child2.tail
('child2', ' blah2 blah2 blah2 ')

As for a complete solution, I discovered that this answer is doing something very similar, that you can easily adapt to your usecase (by not printing the name of elements).

Edit: actually, the simplest solution by far, in my opinion, is to use itertext:

>>> ''.join(root.itertext())
'This is the start of parent tag...child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2 '

Get the entire parent tag's text in ElementTree

2 Answers2