XML Parsing with Python and minidom

Question

I'm using Python (minidom) to parse an XML file that prints a hierarchical structure that looks something like this (indentation is used here to show the significant hierarchical relationship):

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

Instead, the program iterates multiple times over the nodes and produces the following, printing duplicate nodes. (Looking at the node list at each iteration, it's obvious why it does this but I can't seem to find a way to get the node list I'm looking for.)

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

Here is the XML source file:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Here is the Python program:

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

I could fix the problem by not nesting 'Topic' elements, by changing the lower level topic names to something like 'SubTopic1' and 'SubTopic2'. But, I want to take advantage of built-in XML hierarchical structuring without needing different element names; it seems that I should be able to nest 'Topic' elements and that there should be some way to know which level 'Topic' I'm currently looking at.

I've tried a number of different XPath functions without much success.

If you want the output of the first one you can just print the text out of each element - I am not clear how the structuting affects the wanted output — mmmmmm, Oct 20 '09 at 20:36

bobince · Answer 1 · 2009-10-21T02:11:25.457

10

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

edited Oct 21 '09 at 02:11

answered Oct 20 '09 at 22:17

bobince

528,062
107
651
834

Thanks for the attempt. It didn't work but it gave me some ideas. The following works (the same general idea; FWIW, the nodeType is ELEMENT_NODE): import xml.dom.minidom from xml.dom.minidom import Node dom = xml.dom.minidom.parse("docmap.xml") def getChildrenByTitle(node): for child in node.childNodes: if child.localName=='Title': yield child Topic=dom.getElementsByTagName('Topic') for node in Topic: alist=getChildrenByTitle(node) for a in alist: # Title= a.firstChild.data Title= a.childNodes[0].nodeValue print Title – hWorks Oct 21 '09 at 00:03

score 8 · Answer 2 · edited Jan 26 '21 at 23:47

8

The following works:

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
        Title= a.childNodes[0].nodeValue
        print Title

edited Jan 26 '21 at 23:47

Alan W. Smith

24,647
4
70
96

answered Oct 21 '09 at 00:04

hWorks

293
1
3
10

I would call the function getTitle (or `get_title`), and have it not return all immediate child Title elements, but just the first one (as there should be just one title per child, anyway). – Martin v. Löwis Oct 21 '09 at 03:52
Maybe this is what I'm not getting. I want the titles of all immediate children. Maybe a better name would be getTitlesOfChildren. – hWorks Oct 21 '09 at 16:37

0x3bfc · Answer 3 · 2014-01-28T16:13:40.700

I think that can help

import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
f = open("file.xml",'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('Topic'):
   title= doc.getElementsByTagName('Title')[i].firstChild.nodeValue
   print title
   i +=1

Output:

My Document
Overview
Basic Features
About This Software
Platforms Supported

score 3 · Answer 4 · answered Oct 21 '09 at 18:45

You could use the following generator to run through the list and get titles with indentation levels:

def f(elem, level=-1):
    if elem.nodeName == "Title":
        yield elem.childNodes[0].nodeValue, level
    elif elem.nodeType == elem.ELEMENT_NODE:
        for child in elem.childNodes:
            for e, l in f(child, level + 1):
                yield e, l

If you test it with your file:

import xml.dom.minidom as minidom
doc = minidom.parse("test.xml")
list(f(doc))

you will get a list with the following tuples:

(u'My Document', 1), 
(u'Overview', 1), 
(u'Basic Features', 2), 
(u'About This Software', 2), 
(u'Platforms Supported', 3)

It is only a basic idea to be fine-tuned of course. If you just want spaces at the beginning you can code that directly in the generator, though with the level you have more flexibility. You could also detect the first level automatically (here it's just a poor job of initializing the level to -1...).

Exactly what I've been trying to do all day before coming upon generators. Many thanks. — hWorks, Oct 21 '09 at 21:42

imesias · Answer 5 · 2013-01-10T10:28:26.843

Recusive function:

import xml.dom.minidom

def traverseTree(document, depth=0):
  tag = document.tagName
  for child in document.childNodes:
    if child.nodeType == child.TEXT_NODE:
      if document.tagName == 'Title':
        print depth*'    ', child.data
    if child.nodeType == xml.dom.Node.ELEMENT_NODE:
      traverseTree(child, depth+1)

filename = 'sample.xml'
dom = xml.dom.minidom.parse(filename)
traverseTree(dom.documentElement)

Your xml:

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Your desired output:

 $ python parse_sample.py 
      My Document
      Overview
          Basic Features
          About This Software
              Platforms Supported

XML Parsing with Python and minidom

5 Answers5

Linked