4

I have an XML file in the following format:

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>

<type...>
</type>
</doc>

I would like to parse this document and build a hash table

{X: {"A": [(100,80), (200,90)], "B": [(100,20), (20,90)]}, Y: .....} 

How would I do this in Python?

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
user231536
  • 2,661
  • 4
  • 33
  • 45
  • This kind of question has been asked a couple times. The answers may be able to help you out. http://stackoverflow.com/questions/191536/converting-xml-to-json-using-python http://stackoverflow.com/questions/471946/how-to-convert-xml-to-json-in-python – Thomas Dec 15 '09 at 16:04

6 Answers6

12

I disagree with the suggestion in other answers to use minidom -- that's a so-so Python adaptation of a standard originally conceived for other languages, usable but not a great fit. The recommended approach in modern Python is ElementTree.

The same interface is also implemented, faster, in third party module lxml, but unless you need blazing speed the version included with the Python standard library is fine (and faster than minidom anyway) -- the key point is to program to that interface, then you can always switch to a different implementation of the same interface in the future if you want to, with minimal changes to your own code.

For example, after the needed imports &c, the following code is a minimal implementation of your example (it does not verify that the XML is correct, just extracts the data assuming correctness -- adding various kinds of checks is pretty easy of course):

from xml.etree import ElementTree as et  # or, import any other, faster version of ET

def xml2data(xmlfile):
  tree = et.parse(xmlfile)
  data = {}
  for anid in tree.getroot().getchildren():
    currdict = data[anid.get('name')] = {}
    for atype in anid.getchildren():
      currlist = currdict[atype.get('name')] = []
      for c in atype.getchildren():
        currlist.append((c.get('val'), c.get('id')))
  return data

This produces your desired result given your sample input.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • `for child in node.getchildren():` is unnecessary; use `for child in node:` instead. – John Machin Dec 16 '09 at 21:27
  • *Warning*: The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities. Just for caution. – igauravsehrawat Jan 27 '15 at 05:39
3

Do not reinvent the wheel. Use Amara toolkit. Variable names are just keys in a dictionary anyway. http://www.xml3k.org/Amara

Hamish Grubijan
  • 10,562
  • 23
  • 99
  • 147
  • Another link - http://www.xml.com/pub/a/2005/01/19/amara.html You will end up with a variable doc, which has doc.id, which has doc.id.type[0], then doc.id.type[0].min, ... and so on. Super-easy to access! – Hamish Grubijan Dec 15 '09 at 21:33
2

I would recommend using the minidom library.

The docs are pretty good so you should be up and running in no time.

Dan.

2

As others have stated minidom is the way to go here. You open (and parse) the file, while going through the nodes you check if its relevant and should be read. That way, you also know if you want to read the child nodes.

Threw together this, seems to do what you want. Some of the values are read by attribute position rather than attribute name. And theres no error handling. And the print () at the end means its Python 3.x.

I'll leave it as an exercise to improve upon that, just wanted to post a snippet to get you started.

Happy hacking! :)

xml.txt

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>
</doc>

parsexml.py

from xml.dom import minidom
data={}
doc=minidom.parse("xml.txt")
for n in doc.childNodes[0].childNodes:
    if n.localName=="id":
        id_name = n.attributes.item(0).nodeValue
        data[id_name] = {}
        for j in n.childNodes:
            if j.localName=="type":
                type_name = j.attributes.item(0).nodeValue
                data[id_name][type_name] = [(),()]
                for k in j.childNodes:
                    if k.localName=="min":
                        data[id_name][type_name][0] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
                    if k.localName=="max":
                        data[id_name][type_name][1] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
print (data)

Output:

{'X': {'A': [('100', '80'), ('200', '90')], 'B': [('100', '20'), ('20', '90')]}}
Mizipzor
  • 51,151
  • 22
  • 97
  • 138
1

Another XML parsing library: http://www.crummy.com/software/BeautifulSoup/

miku
  • 181,842
  • 47
  • 306
  • 310
  • Im more familiar with BeautifulSoup and parsing URLs than local XML files, so this is a great gapping solution for me. – Ben Keating Mar 08 '11 at 23:47
0

Why not try something like the PyXml library. They have lots of documentation and tutorials.

Gordon
  • 4,823
  • 4
  • 38
  • 55
  • 3
    **WARNING** Norwegian Blue Parrot syndrome: Last release 5 years ago. No Windows installers for Python 2.5 and 2.6. – John Machin Dec 16 '09 at 21:27