Parsing XML to a hash table

Question

I have an XML file in the following format:

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>

<type...>
</type>
</doc>

I would like to parse this document and build a hash table

{X: {"A": [(100,80), (200,90)], "B": [(100,20), (20,90)]}, Y: .....}

How would I do this in Python?

This kind of question has been asked a couple times. The answers may be able to help you out. http://stackoverflow.com/questions/191536/converting-xml-to-json-using-python http://stackoverflow.com/questions/471946/how-to-convert-xml-to-json-in-python — Thomas, Dec 15 '09 at 16:04

score 12 · Answer 1 · answered Dec 15 '09 at 17:18

I disagree with the suggestion in other answers to use minidom -- that's a so-so Python adaptation of a standard originally conceived for other languages, usable but not a great fit. The recommended approach in modern Python is ElementTree.

The same interface is also implemented, faster, in third party module lxml, but unless you need blazing speed the version included with the Python standard library is fine (and faster than minidom anyway) -- the key point is to program to that interface, then you can always switch to a different implementation of the same interface in the future if you want to, with minimal changes to your own code.

For example, after the needed imports &c, the following code is a minimal implementation of your example (it does not verify that the XML is correct, just extracts the data assuming correctness -- adding various kinds of checks is pretty easy of course):

from xml.etree import ElementTree as et  # or, import any other, faster version of ET

def xml2data(xmlfile):
  tree = et.parse(xmlfile)
  data = {}
  for anid in tree.getroot().getchildren():
    currdict = data[anid.get('name')] = {}
    for atype in anid.getchildren():
      currlist = currdict[atype.get('name')] = []
      for c in atype.getchildren():
        currlist.append((c.get('val'), c.get('id')))
  return data

This produces your desired result given your sample input.

`for child in node.getchildren():` is unnecessary; use `for child in node:` instead. — John Machin, Dec 16 '09 at 21:27
*Warning*: The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities. Just for caution. — igauravsehrawat, Jan 27 '15 at 05:39

score 3 · Answer 2 · answered Dec 15 '09 at 21:25

3

Do not reinvent the wheel. Use Amara toolkit. Variable names are just keys in a dictionary anyway. http://www.xml3k.org/Amara

answered Dec 15 '09 at 21:25

Hamish Grubijan

10,562
23
99
147

Another link - http://www.xml.com/pub/a/2005/01/19/amara.html You will end up with a variable doc, which has doc.id, which has doc.id.type[0], then doc.id.type[0].min, ... and so on. Super-easy to access! – Hamish Grubijan Dec 15 '09 at 21:33

score 2 · Answer 3 · answered Dec 15 '09 at 16:00

2

I would recommend using the minidom library.

The docs are pretty good so you should be up and running in no time.

Dan.

answered Dec 15 '09 at 16:00

freeasinbeer

29
2

score 2 · Accepted Answer · answered Dec 15 '09 at 16:38

As others have stated minidom is the way to go here. You open (and parse) the file, while going through the nodes you check if its relevant and should be read. That way, you also know if you want to read the child nodes.

Threw together this, seems to do what you want. Some of the values are read by attribute position rather than attribute name. And theres no error handling. And the print () at the end means its Python 3.x.

I'll leave it as an exercise to improve upon that, just wanted to post a snippet to get you started.

Happy hacking! :)

xml.txt

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>
</doc>

parsexml.py

from xml.dom import minidom
data={}
doc=minidom.parse("xml.txt")
for n in doc.childNodes[0].childNodes:
    if n.localName=="id":
        id_name = n.attributes.item(0).nodeValue
        data[id_name] = {}
        for j in n.childNodes:
            if j.localName=="type":
                type_name = j.attributes.item(0).nodeValue
                data[id_name][type_name] = [(),()]
                for k in j.childNodes:
                    if k.localName=="min":
                        data[id_name][type_name][0] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
                    if k.localName=="max":
                        data[id_name][type_name][1] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
print (data)

Output:

{'X': {'A': [('100', '80'), ('200', '90')], 'B': [('100', '20'), ('20', '90')]}}

Sorry, wrong room. The fugly code competition is down the hall. — John Machin, Dec 15 '09 at 21:31

score 1 · Answer 5 · answered Dec 15 '09 at 16:04

1

Another XML parsing library: http://www.crummy.com/software/BeautifulSoup/

Parsing XML documentation starts here: http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20XML

answered Dec 15 '09 at 16:04

miku

181,842
47
306
310

Im more familiar with BeautifulSoup and parsing URLs than local XML files, so this is a great gapping solution for me. – Ben Keating Mar 08 '11 at 23:47

score 0 · Answer 6 · answered Dec 15 '09 at 16:02

0

Why not try something like the PyXml library. They have lots of documentation and tutorials.

answered Dec 15 '09 at 16:02

Gordon

4,823
4
38
55

3

**WARNING** Norwegian Blue Parrot syndrome: Last release 5 years ago. No Windows installers for Python 2.5 and 2.6. – John Machin Dec 16 '09 at 21:27

Parsing XML to a hash table

6 Answers6