1

In Python, how do I remove a node but keep its children using xml.etree API?

Yes I know there's an answer using lxml but since xml.etree is part of Python website, I figure it deserves an answer too.

Original xml file:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

Let's say I want to remove country nodes but keep the children and assign them to the parent of country?

Ideally, I want a solution that does things "in place" instead of creating a new tree.

My (non-working) solution:

# Get all parents of `country`
for country_parent in root.findall(".//country/.."):
    print(country_parent.tag)
    # Some countries could have same parent so get all
    # `country` nodes of current parent
    for country in country_parent.findall("./country"):
        print('\t', country.tag)
        # For each child of `country`, assign it to parent
        # and then delete it from `parent`
        for country_child in country:
            print('\t\t', country_child.tag)
            country_parent.append(country_child)
            country.remove(country_child)
        country_parent.remove(country)
tree.write("test_mod.xml")

Output of my print statements:

data
     country
         rank
         gdppc
         neighbor
     country
         rank
         gdppc
     country
         rank
         gdppc
         neighbor

Right away we can see there's a problem: country is missing the tag year and some neighbor tags.

The resulting .xml output:

<data>
    <rank>1</rank>
        <gdppc>141100</gdppc>
        <neighbor direction="W" name="Switzerland" />
    <rank>4</rank>
        <gdppc>59900</gdppc>
        <rank>68</rank>
        <gdppc>13600</gdppc>
        <neighbor direction="E" name="Colombia" />
    </data>

This is obviously wrong.

QUESTION: Why does this happen?

I can imagine it's from the appending/removing breaking something with the list i.e. I've "invalidated" the list similar to iterator.

Community
  • 1
  • 1
Bob
  • 4,576
  • 7
  • 39
  • 107
  • `for country_child in country[:]:` as per the answer linked to in your last question, http://stackoverflow.com/questions/37702011/removing-an-element-from-a-parsed-xml-tree-disrupts-iteration – Padraic Cunningham Jun 25 '16 at 10:03

1 Answers1

1

Remove this line from your program:

        country.remove(country_child)

The iteration of an xml.etree.ElementTree.Element is essentially passed through to the list of sub-elements. Modifying that list during iteration will yield odd results.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Why does that line break things? – Bob Jun 24 '16 at 20:49
  • I'm checking on that now. I believe it is for the reason you surmised, that you are modifying the list while you are iterating over it. – Robᵩ Jun 24 '16 at 20:50
  • but it's a `list` of elements not an iterator, right? – Bob Jun 24 '16 at 20:51
  • The reason why I have that line there is to minimize memory usage. If I remove that line, I'll be temporarily adding sizeof(country's children). – Bob Jun 24 '16 at 20:54
  • I don't think you'll be adding any memory. `country_parent.append()` won't create a new `Element`, it will only link in the existing one. – Robᵩ Jun 24 '16 at 20:56
  • The line `for country_child in country:` creates an iterator, not a list. If you think you need a list, you could change that line to: `for country_child in list(country):`. – Robᵩ Jun 24 '16 at 20:57
  • My understanding of Python iterators is not very good. In `for a in b:`, which one is the iterator assuming this is a top-level for-loop? – Bob Jun 24 '16 at 21:07
  • http://stackoverflow.com/questions/9884132/what-exactly-are-pythons-iterator-iterable-and-iteration-protocols – Robᵩ Jun 24 '16 at 21:17
  • I've tried reading this but it doesn't make sense; can you rephrase? `The iteration of an xml.etree.ElementTree.Element is essentially passed through to the list of sub-elements.` – Bob Jun 24 '16 at 21:20
  • I get that the implementation of an iterator is paramount to the correctness of modifying what it's pointing to. But we're talking about a list; why would deleting list[0] cause the iterator to forget that the next element is list[1]? – Bob Jun 24 '16 at 21:26
  • `append()` does add new instance. See `ElementTree.py:Element:append():... self._children.append(subelement)`, `ElementTree.py:Element:__init__:...self._children = []` so it's just a simple list. – Bob Jun 24 '16 at 21:39
  • Ping. I don't understand `"The iteration of an xml.etree.ElementTree.Element is essentially passed through to the list of sub-elements. Modifying that list during iteration will yield odd results."` Please rephrase. – Bob Jun 28 '16 at 15:35