6

I am currently trying to load an xml file and modify the text inside a pair of xml tags, like this:

   <anode>sometext</anode>

I currently have a helper function called getText that I use to get the text sometext above. Now I need to modify the childnodes I guess, inside the node to modify a node that has the XML snippet shown above, to change sometext to othertext. The common API patch getText function is shown below in the footnote.

So my question is, that's how we get the text, how do I write a companion helper function called setText(node,'newtext'). I'd prefer if it operated on the node level, and found its way down to the childnodes all on its own, and worked robustly.

A previous question has an accepted answer that says "I'm not sure you can modify the DOM in place". Is that really true? Is Minidom so broken that it's effectively Read Only?


By way of footnote, to read text between <anode> and </anode>, I took was surprised no direct simple single minidom function exists, and that this small helper function is suggested in the Python xml tutorials:

import xml.dom.minidom

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

# I've added this bit to make usage of the above clearer
def getTextFromNode(node):
   return getText(node.childNodes)

Elsewhere in StackOverflow, I see this accepted answer from 2008:

   node[0].firstChild.nodeValue

If that's how hard it is to read with minidom, I'm not suprised to see that people say "Just don't do it!" when you ask how to write things that might modify the Node structure of your XML document.

Update The answer below shows it's not as hard as I thought.

Community
  • 1
  • 1
Warren P
  • 65,725
  • 40
  • 181
  • 316

1 Answers1

6

actually minidom is no more difficult to use than other dom parsers, if you dont like it you may want to consider complaining to the w3c

from xml.dom.minidom import parseString

XML = """
<nodeA>
    <nodeB>Text hello</nodeB>
    <nodeC><noText></noText></nodeC>
</nodeA>
"""


def replaceText(node, newText):
    if node.firstChild.nodeType != node.TEXT_NODE:
        raise Exception("node does not contain text")

    node.firstChild.replaceWholeText(newText)

def main():
    doc = parseString(XML)

    node = doc.getElementsByTagName('nodeB')[0]
    replaceText(node, "Hello World")

    print doc.toxml()

    try:
        node = doc.getElementsByTagName('nodeC')[0]
        replaceText(node, "Hello World")
    except:
        print "error"


if __name__ == '__main__':
    main()
Christian Thieme
  • 1,114
  • 6
  • 6
  • I'm just trying to understand why everybody says "use something else". If it's a standard API defined by the W3C then that's a good reason to learn it! – Warren P Nov 27 '12 at 20:09
  • 1
    The W3C standard was first created over a decade ago, and had to work for many different languages and programming paradigms. The DOM is a common denominator, and is dumbed down enormously. In python, a dynamic language, you can do *far* better than the DOM API. – Martijn Pieters Nov 28 '12 at 21:36
  • I guess it bugs me that Python, land of orthogonal sensible apis, is saddled with the least orthogonal, least sensible API I've ever seen for this task. Oh, and Python, language of there's one right way to do it, has a billion XML apis. :-) – Warren P Nov 29 '12 at 01:07
  • But then, I'm complaining about the rich ecosystem of LIBRARIES. There is one right way to do it in the language. Libraries are going to be plural. Fine. Me in 2016 thinks me in 2012 was wrong. – Warren P Sep 28 '16 at 15:54
  • It depends what you're trying to do. If you want very controlled manipulation of the XML itself, dom is the right thing. If you want to get at the stuff inside the document and XML is just an encoding, consider elementtree – Steve Carter Dec 03 '18 at 11:44