-1

i have some xml and i would like to get a subset of the inner xml i have looked through docs and examples for ElementTree but have not really found a solution

Given the example below is there an easy way to achieve this? original xml:

    <a>
        <b1>not interested</b1>
        <b2 key="not interested at all">
            <c key="i want this">
                <d1> the good stuff</d1>
                <d2> more good stuff </d2>
                <d3>
                    <e1 key="good">still good stuff</e1>
                </d3>
            </c>
        </b2>
    </a>

and i want to pull out some inner xml so result would be like:

<c key="i want this">
    <d1> the good stuff</d1>
    <d2> more good stuff </d2>
    <d3>
        <e1 key="good">still good stuff</e1>
    </d3>
</c>

2 Answers2

0

It can be easily done with lxml:

import lxml.html as lh
data ="""[your html above]"""

doc = lh.fromstring(data)
target = doc.xpath('.//c')
print(lh.tostring(target[0]).decode())

The output is your expected output.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
0

Following is a solution using the xml.etree.ElementTree module and the example XML you provided in your question.

See corresponding notes in code sample following.

  1. Create root element from the source XML text.

    The xml.etree.ElementTree.fromstring() function parses the provided XML string and returns an Element instance.

  2. Use XPath query to locate the new root Element.

    The findall() function returns a list of matching Element objects from the source Element object.

    Since you are trying to establish a new root for your new XML document, this query must be designed to match one and only one element from the source document, hence the extraction of new_root via [0]. (Insert appropriate error handling here!)

The ElementTree module has limited XPath support, but here is a breakdown of the query string:

.//c: Search for all <c> elements

[@key='i want this']: Filter found <c> elements and return only those with a key attribute matching 'i want this'

  1. Encode new root Element to a Unicode string.

    The xml.etree.ElementTree.tostring() function renders the provided Element and its children to XML text. The encoding="unicode" is specified since the default encoding returns a byte string.

Code sample:

import xml.etree.ElementTree as ET

if __name__ == "__main__":
    # 0. Assign test XML text string.
    my_xml = '''<a>
        <b1>not interested</b1>
        <b2 key="not interested at all">
            <c key="i want this">
                <d1> the good stuff</d1>
                <d2> more good stuff </d2>
                <d3>
                    <e1 key="good">still good stuff</e1>
                </d3>
            </c>
        </b2>
    </a>'''
    # 1. Create root Element from the source XML text.
    root = ET.fromstring(my_xml)
    # 2. Use XPath query to locate the new root Element.
    new_root = root.findall(".//c[@key='i want this']")[0]
    # 3. Encode new root Element to a Unicode string.
    my_new_xml = ET.tostring(new_root, encoding="unicode")
    print(my_new_xml)
Phil Brubaker
  • 1,257
  • 3
  • 11
  • 14