5

I have an xml file of the form:

<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>

I need to process it so that, for instance, when the user inputs nd, the program matches it with the <Phonetic> tag and returns and from the <Phonemic> part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.

I searched and found xmltodict which is used for the same purpose:

import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
    obj = xmltodict.parse(fd.read())

Running this gives me an ordered dict:

>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])

Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd I'd have to write:

obj['NewDataSet']['Root'][0]['Phonetic']

which is ridiculously complicated. I tried to make it into a regular dictionary by dict() but as it is nested, the inner layers remain ordered and my data is so big.

Anshul Goyal
  • 73,278
  • 37
  • 149
  • 186
Omid
  • 2,617
  • 4
  • 28
  • 43
  • 1
    How would converting to a regular dictionary make any difference? You will still have as many layers of keys. What *exactly* is the problem; do you just not like the `OrderedDict.__repr__`? – jonrsharpe Nov 14 '14 at 09:16

3 Answers3

6

If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic'], IMO, you are not doing it right.

Instead, you can do the following

obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]

Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.

PS: I had the same issues with xmltodict. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.

EDIT

Following code works for me

import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]
Anshul Goyal
  • 73,278
  • 37
  • 149
  • 186
  • Thanks. I think the last line should be `print element[0]['Phonemic']` otherwise it would complain that the indices should be integers not `str`. – Omid Nov 14 '14 at 09:31
  • @novice66 no it won't be, reason being the index is taken care of because of the for loop I use. Did you face any issues trying out the code? – Anshul Goyal Nov 14 '14 at 09:32
  • 1
    I just ran it (in Python 3, having added the parentheses around `print`) and I got the error: `TypeError: list indices must be integers, not str` – Omid Nov 14 '14 at 09:34
  • @novice66 Check edits. I am on python 2, so that may be causing it. – Anshul Goyal Nov 14 '14 at 09:38
  • With your last statement, comparison to `etree` you mean that `xmltodict` is slimmer regarding its code and therefore easier to handle as compared to others that are more blown? – Timo Jul 01 '21 at 19:34
6

You can actually avoid conversion to OrderedDict by setting an additional keyword paramter:

obj = xmltodict.parse(xmldata, dict_constructor=dict)

parse is forwarding keyword arguments to _DictSAXHandler and dict_constructor is by default set to OrderedDict.

mome
  • 61
  • 1
  • 3
0

Mu's answer worked for me, the only thing I had to change was the tricky ensure root_element is always a list step.: -

import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj["Root"]) == list else [obj["Root"]] 
# Above step ensures that root_elements is always a list
# Is obj["Root"] a list already, then use obj["Root"], otherwise make single element list.
for element in root_elements:
    print element["Phonetic"]