96

I need to completely remove elements, based on the contents of an attribute, using python's lxml. Example:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

I would like this to print:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Is there a way to do this without storing a temporary variable and printing to it manually, as:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"
ewok
  • 20,148
  • 51
  • 149
  • 254

6 Answers6

185

Use the remove method of an xmlElement :

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly under the root node of your xml.

Sheena
  • 15,590
  • 14
  • 75
  • 113
Cédric Julien
  • 78,516
  • 15
  • 127
  • 132
  • 1
    Can you comment on the differences between this answer and the one provided by Acorn? – ewok Nov 02 '11 at 14:27
  • 1
    It's a shame the Element class doesn't have a 'pop' method. – Michael Mulich Aug 28 '15 at 18:17
  • it's a shame xpath can only be used to select elements. it is like SQL with only the select statements. – Eric Chow Jan 12 '21 at 08:44
  • The `remove` function detaches an element from the tree and therefore removes the XML node (Element, PI or Comment), its content (the descendant items) and the `tail` text. Here, preserving the `tail` text is superfluous because it only contains whitespaces and a newline. But, in some situation you may need to keep it… – Laurent LAPORTE Mar 17 '21 at 08:54
  • To preserve the `tail` text and to optionally keep the element content, you can consider using the [remove_node](https://stackoverflow.com/a/66670197/1513933) function defined bellow. – Laurent LAPORTE Mar 17 '21 at 09:28
31

You're looking for the remove function. Call the tree's remove method and pass it a subelement to remove.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Result:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
Acorn
  • 49,061
  • 27
  • 133
  • 172
  • You've just got all the lxml-related answers for me, don't you? ;-) – ewok Nov 02 '11 at 14:25
  • Can you comment on the differences between this answer and the one provided by Cedric? – ewok Nov 02 '11 at 14:27
  • 3
    Ah, I overlooked the fact that `.remove()` requires the element to be a child of the element you are calling it on. So you need to call it on the parent of the element you want to remove. Answer corrected. – Acorn Nov 02 '11 at 14:34
  • @Acorn : that's it, if the element to remove were not directly under the root node, it would have fail. – Cédric Julien Nov 02 '11 at 14:38
  • understood. Does it need to be a child or any descendant? I ask because, given the fact that the xpath expression is run on `tree`, is is certain that any element that is returned is a descendant of `tree`, and therefore `tree.remove()` would work properly. – ewok Nov 02 '11 at 14:38
  • @ewok: it needs to be a child. Try `tree.remove(bad)` with the updated xml above and you'll see the exception. – Acorn Nov 02 '11 at 14:40
  • 18
    @ewok: give Cédric the accept as he answered **1 second** earlier than me, and more importantly, his answer was correct :) – Acorn Nov 02 '11 at 14:47
  • If you can only remove a child of an element, how do you remove the root element? – davidA Jul 20 '16 at 02:03
15

I met one situation:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script) will remove the text here part which I didn't mean to.

following the answer here, I found that etree.strip_elements is a better solution for me, which you can control whether or not you will remove the text behind with with_tail=(bool) param.

But still I don't know if this can use xpath filter for tag. Just put this for informing.

Here is the doc:

strip_elements(tree_or_element, *tag_names, with_tail=True)

Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.

Tag names can contain wildcards as in _Element.iter.

Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. If you want to include the root element, check its tag name directly before even calling this function.

Example usage::

   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )
zephor
  • 687
  • 5
  • 13
  • Notice that `strip_elements` (and `strip_tags` too) removes *all* descendant elements which tag name matches one of the * tag_names* names. – Laurent LAPORTE Mar 17 '21 at 09:26
5

As already mentioned, you can use the remove() method to delete (sub)elements from the tree:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

But it removes the element including its tail, which is a problem if you are processing mixed-content documents like HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Becomes

<div></div>

Which is I suppose what you not always want :) I have created helper function to remove just the element and keep its tail:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

This way it will keep the tail text:

<div> Hello!</div>
Messa
  • 24,321
  • 6
  • 68
  • 92
1

You could also use html from lxml to solve that:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

It should output this:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>
Guven Degirmenci
  • 684
  • 7
  • 16
1

The remove function detaches an element from the tree and therefore removes the XML node (Element, PI or Comment), its content (the descendant items) and the tail text. Here, preserving the tail text is superfluous because it only contains whitespaces and a newline, which can be considered ignorable whitespaces.

To remove a element (and its content), preserving its tail, you can use the following function:

def remove_node(child, keep_content=False):
    """
    Remove an XML element, preserving its tail text.

    :param child: XML element to remove
    :param keep_content: ``True`` to keep child text and sub-elements.
    """
    parent = child.getparent()
    parent_text = parent.text or u""
    prev_node = child.getprevious()
    if keep_content:
        # insert: child text
        child_text = child.text or u""
        if prev_node is None:
            parent.text = u"{0}{1}".format(parent_text, child_text) or None
        else:
            prev_tail = prev_node.tail or u""
            prev_node.tail = u"{0}{1}".format(prev_tail, child_text) or None
        # insert: child elements
        index = parent.index(child)
        parent[index:index] = child[:]
    # insert: child tail
    parent_text = parent.text or u""
    prev_node = child.getprevious()
    child_tail = child.tail or u""
    if prev_node is None:
        parent.text = u"{0}{1}".format(parent_text, child_tail) or None
    else:
        prev_tail = prev_node.tail or u""
        prev_node.tail = u"{0}{1}".format(prev_tail, child_tail) or None
    # remove: child
    parent.remove(child)

Here is a demo:

from lxml import etree

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1)

etree.dump(tree)
# <root>text  tail</root>

If you want to preserve the content, you can do:

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1, keep_content=True)

etree.dump(tree)
# <root>text before <bad>inner</bad> after tail</root>
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103