2

Getting a specific value based on the layout of an xml-file is pretty straight forward. (See: StackOverflow)

But when I don't know the xml-elements, I can't recurse over it. Since xmltodoc nests OrderedDicts in OrderedDicts. These nested OrderedDicts are typified by Python as type: 'unicode'. And not (still) as OrderedDicts. Therefor looping over like this, doens't work:

def myprint(d):
    for k, v in d.iteritems():
        if isinstance(v, list):
            myprint(v)
        else:
            print "Key :{0},  Value: {1}".format(k, v)

What I basically want is to recursive over the whole xml-file where every key-value pair is shown. And when a value of a key is another list of key-value pairs, it should recursive into it.

With this xml-file as input:

<?xml version="1.0" encoding="utf-8"?>
<session id="2934" name="Valves" docVersion="5.0.1">
    <docInfo>
        <field name="Employee" isMandotory="True">Jake Roberts</field>
        <field name="Section" isOpen="True" isMandotory="False">5</field>
        <field name="Location" isOpen="True" isMandotory="False">Munchen</field>
    </docInfo>
</session>

and the above listed code, all data under session is added as a value to the key session.

Example output:

Key :session,  Value: OrderedDict([(u'@id', u'2934'), (u'@name', u'Valves'), (u'@docVersion', u'5.0.1'), (u'docInfo', OrderedDict([(u'field', [OrderedDict([(u'@name', u'Employee'), (u'@isMandotory', u'True'), ('#text', u'Jake Roberts')]), OrderedDict([(u'@name', u'Section'), (u'@isOpen', u'True'), (u'@isMandotory', u'False'), ('#text', u'5')]), OrderedDict([(u'@name', u'Location'), (u'@isOpen', u'True'), (u'@isMandotory', u'False'), ('#text', u'Munchen')])])]))])

And this is obviously not what I want.

Community
  • 1
  • 1
JakeRoberts
  • 53
  • 2
  • 9

1 Answers1

4

If you come across a list in the data then you just need to call myprint on every element of the list:

def myprint(d):
    if isinstance(d,dict): #check if it's a dict before using .iteritems()
        for k, v in d.iteritems():
            if isinstance(v, (list,dict)): #check for either list or dict
                myprint(v)
            else:
                print "Key :{0},  Value: {1}".format(k, v)
    elif isinstance(d,list): #allow for list input too
        for item in d:
            myprint(item)

then you will get an output something like:

...
Key :@name,  Value: Employee
Key :@isMandotory,  Value: True
Key :#text,  Value: Jake Roberts
Key :@name,  Value: Section
Key :@isOpen,  Value: True
Key :@isMandotory,  Value: False
Key :#text,  Value: 5
...

Although I'm not sure how useful this is since you have a lot of duplicate keys like @name, I'd like to offer a function I created a while ago to traverse nested json data of nested dicts and lists:

def traverse(obj, prev_path = "obj", path_repr = "{}[{!r}]".format, yield_empty_lists_and_dicts=True):
    it = None
    if isinstance(obj,dict):
        it = obj.items()
    elif isinstance(obj,list):
        it = enumerate(obj)
    if it is None or (yield_empty_lists_and_dicts and len(obj)==0):
        yield prev_path,obj
        return
    for k,v in it:
        for data in traverse(v, path_repr(prev_path,k), path_repr):
            yield data

Then you can traverse the data with:

for path,value in traverse(doc):
    print("{} = {}".format(path,value))

with the default values for prev_path and path_repr it gives output like this:

obj[u'session'][u'@id'] = 2934
obj[u'session'][u'@name'] = Valves
obj[u'session'][u'@docVersion'] = 5.0.1
obj[u'session'][u'docInfo'][u'field'][0][u'@name'] = Employee
obj[u'session'][u'docInfo'][u'field'][0][u'@isMandotory'] = True
obj[u'session'][u'docInfo'][u'field'][0]['#text'] = Jake Roberts
obj[u'session'][u'docInfo'][u'field'][1][u'@name'] = Section
obj[u'session'][u'docInfo'][u'field'][1][u'@isOpen'] = True
obj[u'session'][u'docInfo'][u'field'][1][u'@isMandotory'] = False
obj[u'session'][u'docInfo'][u'field'][1]['#text'] = 5
obj[u'session'][u'docInfo'][u'field'][2][u'@name'] = Location
obj[u'session'][u'docInfo'][u'field'][2][u'@isOpen'] = True
obj[u'session'][u'docInfo'][u'field'][2][u'@isMandotory'] = False
obj[u'session'][u'docInfo'][u'field'][2]['#text'] = Munchen

although you can write a function for path_repr to take the value of prev_path (determined by recursively calling path_repr) and the new key, for example a function to take a tuple and add another element on the end means we can get a (tuple of indices : elem) format which is perfect to pass to the dict constructor

def _tuple_concat(tup, idx):
    return (*tup, idx)   
def flatten_data(obj):
    """converts nested dict and list structure into a flat dictionary with tuple keys
    corresponding to the sequence of indices to reach particular element"""
    return dict(traverse(obj, (), _tuple_concat))

new_data = flatten_data(obj)
import pprint
pprint.pprint(new_data)

which gives you the data in this dictionary format:

{('session', '@docVersion'): '5.0.1',
 ('session', '@id'): 2934,
 ('session', '@name'): 'Valves',
 ('session', 'docInfo', 'field', 0, '#text'): 'Jake Roberts',
 ('session', 'docInfo', 'field', 0, '@isMandotory'): True,
 ('session', 'docInfo', 'field', 0, '@name'): 'Employee',
 ('session', 'docInfo', 'field', 1, '#text'): 5,
 ('session', 'docInfo', 'field', 1, '@isMandotory'): False,
 ('session', 'docInfo', 'field', 1, '@isOpen'): True,
 ('session', 'docInfo', 'field', 1, '@name'): 'Section',
 ('session', 'docInfo', 'field', 2, '#text'): 'Munchen',
 ('session', 'docInfo', 'field', 2, '@isMandotory'): False,
 ('session', 'docInfo', 'field', 2, '@isOpen'): True,
 ('session', 'docInfo', 'field', 2, '@name'): 'Location'}

I found this particularly useful when dealing with my json data but I'm not really sure what you want to do with your xml.

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
  • Wow. That's super! Works like a charm. Especially the traverse function! Thank you very much! – JakeRoberts Apr 13 '16 at 07:48
  • This is wicked cool. I use it to create a pandas DataFrame so I can compare json and xml. – elmotec Apr 05 '19 at 16:17
  • 1
    with `pd.DataFrame.from_records(data=[tup for tup in traverse(xml_dict, root_name)], columns=['key', 'value'], index='key')` – elmotec Apr 05 '19 at 16:31