13

It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input:

input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

... you could easily remove every <r> element with:

from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
    r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)

However, how would you go about replacing each element with text, to get the output:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

It seems to me that because the ElementTree API deals with text via the .text and .tail attributes of each element rather than nodes in the tree, this means you have to deal with a lot of different cases depending on whether the element has sibling elements or not, whether the existing element had a .tail attribute, and so on. Have I missed some easy way of doing this?

Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • If `` has children, do you want those removed too? Or merged into ``'s parent? – MattH Mar 24 '11 at 11:49
  • In this case I just want to remove the `` node and all its children, and replace it with a text string. Hopefully that's easier :) – Mark Longair Mar 24 '11 at 12:00

3 Answers3

20

I think that unutbu's XSLT solution is probably the correct way to achieve your goal.

However, here's a somewhat hacky way to achieve it, by modifying the tails of <r/> tags and then using etree.strip_elements.

from lxml import etree

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f = etree.fromstring(data)
for r in f.xpath('//r'):
  r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'

etree.strip_elements(f,'r',with_tail=False)

print etree.tostring(f,pretty_print=True)

Gives you:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
MattH
  • 37,273
  • 11
  • 82
  • 84
  • Thanks, that's a nice solution - I didn't know about `strip_elements` or the `with_tail` optino – Mark Longair Mar 26 '11 at 09:20
  • 4
    Wanted to stick with lxml for html processing. But will probably switch to Beautifulsoup, it's far more intuitive for basic html modification, and can use lxml as a parser... `soup = BeautifulSoup(text, "lxml") / soup.find_all('r').replace_with('DELETED')` – benzkji Aug 22 '18 at 07:37
  • Thanks @benzkij for the tip! It is super weird, that text is sometimes treated as the tail of other nodes in the ElementTree API and not just as a normal text node as intended by xml. – vlz Oct 01 '19 at 10:03
  • 1
    @vlz XML does not intend anything, and the DOM which you're thinking of is but one possible object model. Not being the DOM is ElementTree's entire point, if you want the DOM there are packages which implement it. – Masklinn Apr 14 '23 at 13:49
  • @Masklinn Thanks for clearing that up! I guess was so used to DOM representations of XML from other languages/libraries, that I thought it was the intended way to represent XML. (Still think that it would be more convenient to have text as nodes in a tree similar to elements, but good to know that it is not prescribed by XML itself) – vlz Apr 15 '23 at 15:16
8

Using strip_elements has the disadvantage that you cannot make it keep some of the <r> elements while replacing others. It also requires the existence of an ElementTree instance (which may be not the case). And last, you cannot use it to replace XML comments or processing instructions. The following should do your job:

for r in f.xpath('//r'):
    text = 'DELETED' + r.tail 
    parent = r.getparent()
    if parent is not None:
        previous = r.getprevious()
        if previous is not None:
            previous.tail = (previous.tail or '') + text
        else:
            parent.text = (parent.text or '') + text
        parent.remove(r)
bernulf
  • 89
  • 1
  • 1
  • 2
    I think `text = 'DELETED' + r.tail` should be `text = 'DELETED' + r.tail if r.tail else 'DELETED'`. – mzjn May 10 '12 at 13:10
4

Using ET.XSLT:

import io
import lxml.etree as ET

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f=ET.fromstring(data)
xslt='''\
    <xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">    

    <!-- Replace r nodes with DELETED
         http://www.w3schools.com/xsl/el_template.asp -->
    <xsl:template match="r">DELETED</xsl:template>

    <!-- How to copy XML without changes
         http://mrhaki.blogspot.com/2008/07/copy-xml-as-is-with-xslt.html -->    
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="@*|text()|comment()|processing-instruction">
        <xsl:copy-of select="."/>
    </xsl:template>
    </xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
f=transform(f)

print(ET.tostring(f))

yields

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 3
    +1 That's a nice, but really non-obvious answer :) This question occurred to me because of my insufficient [answer to another question](http://stackoverflow.com/questions/5406326/how-can-i-remove-p-p-with-python-sub/5406515#5406515) and I was hoping there was an easier way than this. Even with a short example like this, XSLT is verbose and difficult to understand compared to the code in my question for just removing the elements. – Mark Longair Mar 24 '11 at 13:29