23

I have an XML file which looks like this:

<encspot>
  <file>
   <Name>some filename.mp3</Name>
   <Encoder>Gogo (after 3.0)</Encoder>
   <Bitrate>131</Bitrate>
   <Mode>joint stereo</Mode>
   <Length>00:02:43</Length>
   <Size>5,236,644</Size>
   <Frame>no</Frame>
   <Quality>good</Quality>
   <Freq.>44100</Freq.>
   <Frames>6255</Frames>
   ..... and so forth ......
  </file>
  <file>....</file>
</encspot>

I want to read it into a python object, something like a list of dictionaries. Because the markup is absolutely fixed, I'm tempted to use regex (I'm quite good at using those). However, I thought I'll check if someone knows how to easily avoid regexes here. I don't have much experience with SAX or other parsing, though, but I'm willing to learn.

I'm looking forward to be shown how this is done quickly without regexes in Python. Thanks for your help!

Felix Dombek
  • 13,664
  • 17
  • 79
  • 131
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 I know you're talking about XML here but the key message that XML "is not a regular language and hence cannot be parsed by regular expressions" remains. Use objectify or cElementTree like answerers suggest. – Stephen Paulger Apr 03 '11 at 21:13
  • 2
    Actually, my given subset of XML is indeed regular. I only proposed to use regex because it is fixed as it is. – Felix Dombek Apr 04 '11 at 12:36
  • 3
    You may look into http://lxml.de/objectify.html or something like http://www.freenet.org.nz/python/xmlobject/ There are lots of variations of XML parsers - check with PyPI. And: forget writing parsers using regular expressions - only fools that don't know better do that. –  Apr 03 '11 at 16:33
  • Thanks! `lxml objectify` seemed a bit difficult at first glance, but I ended up with quite a good result using `xmlobject` (only that I had to manually change the source code to 3.x syntax) ... however I will test the `elementtree` suggestion if I find time and then decide which answer I'm going to accept – Felix Dombek Apr 04 '11 at 12:27
  • Possible duplicate of [How to convert XML to objects?](https://stackoverflow.com/questions/418497/how-to-convert-xml-to-objects) – Stevoisiak Feb 13 '18 at 15:28

5 Answers5

32

My beloved SD Chargers hat is off to you if you think a regex is easier than this:

#!/usr/bin/env python
import xml.etree.cElementTree as et

sxml="""
<encspot>
  <file>
   <Name>some filename.mp3</Name>
   <Encoder>Gogo (after 3.0)</Encoder>
   <Bitrate>131</Bitrate>
  </file>
  <file>
   <Name>another filename.mp3</Name>
   <Encoder>iTunes</Encoder>
   <Bitrate>128</Bitrate>  
  </file>
</encspot>
"""
tree=et.fromstring(sxml)

for el in tree.findall('file'):
    print '-------------------'
    for ch in el.getchildren():
        print '{:>15}: {:<30}'.format(ch.tag, ch.text) 

print "\nan alternate way:"  
el=tree.find('file[2]/Name')  # xpath
print '{:>15}: {:<30}'.format(el.tag, el.text)  

Output:

-------------------
           Name: some filename.mp3             
        Encoder: Gogo (after 3.0)              
        Bitrate: 131                           
-------------------
           Name: another filename.mp3          
        Encoder: iTunes                        
        Bitrate: 128                           

an alternate way:
           Name: another filename.mp3  

If your attraction to a regex is being terse, here is an equally incomprehensible bit of list comprehension to create a data structure:

[(ch.tag,ch.text) for e in tree.findall('file') for ch in e.getchildren()]

Which creates a list of tuples of the XML children of <file> in document order:

[('Name', 'some filename.mp3'), 
 ('Encoder', 'Gogo (after 3.0)'), 
 ('Bitrate', '131'), 
 ('Name', 'another filename.mp3'), 
 ('Encoder', 'iTunes'), 
 ('Bitrate', '128')]

With a few more lines and a little more thought, obviously, you can create any data structure that you want from XML with ElementTree. It is part of the Python distribution.

Edit

Code golf is on!

[{item.tag: item.text for item in ch} for ch in tree.findall('file')] 
[ {'Bitrate': '131', 
   'Name': 'some filename.mp3', 
   'Encoder': 'Gogo (after 3.0)'}, 
  {'Bitrate': '128', 
   'Name': 'another filename.mp3', 
   'Encoder': 'iTunes'}]

If your XML only has the file section, you can choose your golf. If your XML has other tags, other sections, you need to account for the section the children are in and you will need to use findall

There is a tutorial on ElementTree at Effbot.org

the wolf
  • 34,510
  • 13
  • 53
  • 71
  • 3
    Take a penalty stroke for using dict comprehension (2.7+) – John Machin Apr 05 '11 at 07:00
  • 1
    Looks like this is deprecated as of Python 3.3: https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree :( – Newbyte Apr 27 '21 at 08:14
  • Use list(el) or an iterator as 'for ch in el:' as getchildren() is deprecated. For more infos check: https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.getchildren – Flavio Caduff Oct 24 '21 at 21:31
8

Use ElementTree. You don't need/want to muck about with a parse-only gadget like pyexpat ... you'd only end up re-inventing ElementTree partially and poorly.

Another possibility is lxml which is a third-party package which implements the ElementTree interface plus more.

Update Someone started playing code-golf; here's my entry, which actually creates the data structure you asked for:

# xs = """<encspot> etc etc </encspot"""
>>> import xml.etree.cElementTree as et
>>> from pprint import pprint as pp
>>> pp([dict((attr.tag, attr.text) for attr in el) for el in et.fromstring(xs)])
[{'Bitrate': '131',
  'Encoder': 'Gogo (after 3.0)',
  'Frame': 'no',
  'Frames': '6255',
  'Freq.': '44100',
  'Length': '00:02:43',
  'Mode': 'joint stereo',
  'Name': 'some filename.mp3',
  'Quality': 'good',
  'Size': '5,236,644'},
 {'Bitrate': '0', 'Name': 'foo.mp3'}]
>>>

You'd probably want to have a dict mapping "attribute" names to conversion functions:

converters = {
    'Frames': int,
    'Size': lambda x: int(x.replace(',', '')),
    # etc
    }
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • You 'code golf' dict would convert all children of `` correct? You also loose item order if it is meaningful – dawg Apr 05 '11 at 04:40
  • @drewk: (1) According to the OP, the only children of "encspot" are "file" elements -- if not so, then insert `if el.tag == "file"` at the appropriate place. (2) lose item order? The OP wanted a dictionary. It looks like there's 0 or 1 of each kind of item with the order being irrelevant. – John Machin Apr 05 '11 at 06:52
  • @drewk: If you mean the `converters` dict, then yes, you'd convert everything, including strings -- e.g. `x.strip()` – John Machin Apr 05 '11 at 06:57
  • Sometimes document order is relevant in XML. The user just needs to know his data and Python idiom I guess. Thanks for the response... – dawg Apr 05 '11 at 20:56
6

I have also been looking for a simple way to transform data between XML documents and Python data structures, something similar to Golang's XML library which allows you to declaratively specify how to map from data structures to XML.

I was unable to find such a library for Python, so I wrote one to meet my need called declxml for declarative XML processing.

With declxml, you create processors which declaratively define the structure of your XML document. Processors are used to perform both parsing and serialization as well as a basic level of validation.

Parsing this XML data into a list of dictionaries with declxml is straightforward

import declxml as xml

xml_string = """
<encspot>
  <file>
   <Name>some filename.mp3</Name>
   <Encoder>Gogo (after 3.0)</Encoder>
   <Bitrate>131</Bitrate>
  </file>
  <file>
   <Name>another filename.mp3</Name>
   <Encoder>iTunes</Encoder>
   <Bitrate>128</Bitrate>  
  </file>
</encspot>
"""

processor = xml.dictionary('encspot', [
    xml.array(xml.dictionary('file', [
        xml.string('Name'),
        xml.string('Encoder'),
        xml.integer('Bitrate')
    ]), alias='files')
])

xml.parse_from_string(processor, xml_string)

Which produces the following result

{'files': [
  {'Bitrate': 131, 'Encoder': 'Gogo (after 3.0)', 'Name': 'some filename.mp3'},
  {'Bitrate': 128, 'Encoder': 'iTunes', 'Name': 'another filename.mp3'}
]}

Want to parse the data into objects instead of dictionaries? You can do that as well

import declxml as xml

class AudioFile:

    def __init__(self):
        self.name = None
        self.encoder = None
        self.bit_rate = None

    def __repr__(self):
        return 'AudioFile(name={}, encoder={}, bit_rate={})'.format(
            self.name, self.encoder, self.bit_rate)


processor = xml.array(xml.user_object('file', AudioFile, [
    xml.string('Name', alias='name'),
    xml.string('Encoder', alias='encoder'),
    xml.integer('Bitrate', alias='bit_rate')
]), nested='encspot')

xml.parse_from_string(processor, xml_string)

Which produces the output

[AudioFile(name=some filename.mp3, encoder=Gogo (after 3.0), bit_rate=131),
 AudioFile(name=another filename.mp3, encoder=iTunes, bit_rate=128)]
gatkin
  • 561
  • 5
  • 5
1

This code is from my teacher's Github. It converts XML string to Python object. The advantage of this approach is that it works on any XML.

Implement logic:

def xml2py(node):
    """
    convert xml to python object
    node: xml.etree.ElementTree object
    """

    name = node.tag

    pytype = type(name, (object, ), {})
    pyobj = pytype()

    for attr in node.attrib.keys():
        setattr(pyobj, attr, node.get(attr))

    if node.text and node.text != '' and node.text != ' ' and node.text != '\n':
        setattr(pyobj, 'text', node.text)

    for cn in node:
        if not hasattr(pyobj, cn.tag):
            setattr(pyobj, cn.tag, [])
        getattr(pyobj, cn.tag).append(xml2py(cn))

    return pyobj

Define data:

xml_str = '''<?xml version="1.0" encoding="UTF-8"?>
<menza>
    <date day="Monday">
        <meal name="Potato flat cakes">
            <ingredient name="potatoes"/>
            <ingredient name="flour"/>
            <ingredient name="eggs"/>
        </meal>
        <meal name="Pancakes">
            <ingredient name="milk"/>
            <ingredient name="flour"/>
            <ingredient name="eggs"/>
        </meal>
    </date>
</menza>'''

Load object from XML:

import xml.etree.ElementTree as ET
menza_xml_tree = ET.fromstring(xml_str)
obj = xml2py(menza_xml_tree)

Test:

for date in obj.date:
    print(date.day)
    for meal in date.meal:
        print('\t', meal.name)
        for ingredient in meal.ingredient:
            print('\t\t', ingredient.name)
0

If you have a static function which converts a XML to Object it would be something like this

@classmethod   
def from_xml(self,xml_str):
    #create XML Element
    root = ET.fromstring(xml_str)
    # create a dict from it
    d = {ch.tag: ch.text for ch in root.getchildren()}
    # return the object, created with **kwargs called from the Class, that's why its classmethod
    return self(**d)
gneric
  • 3,457
  • 1
  • 17
  • 30