1

I'm trying to extract some data from XML. I'm using xmltodict to load the data into a dictionary, then using list comprehensions to pull out individual parts into separate lists. I will later be plotting these using matplotlib.

XML:

<?xml version="1.0" ?>
<MYDATA>
<SESSION ID="1234">
    <INFO>
        <BEGIN LOAD="23"/>
    </INFO>
    <TRANSACTION ID="2103645570">
        <ANSWER>Hello</ANSWER>
    </TRANSACTION>
    <TRANSACTION ID="4315547431">
        <ANSWER>This is an answer</ANSWER>
    </TRANSACTION>
</SESSION>
<SESSION ID="5678">
    <INFO>
        <BEGIN LOAD="28"/>
    </INFO>
    <TRANSACTION ID="4099381642">
        <ANSWER>Hello</ANSWER>
    </TRANSACTION>
    <TRANSACTION ID="1220404184">
        <ANSWER>A Different answer</ANSWER>
    </TRANSACTION>
    <TRANSACTION ID="201506542">
        <ANSWER>Yet another one</ANSWER>
    </TRANSACTION>
</SESSION>
</MYDATA>

My code:

from collections import OrderedDict

# doc contains the xml exactly as loaded by xmltodict
doc = OrderedDict([(u'MYDATA', OrderedDict([(u'SESSION', [OrderedDict([(u'@ID', u'1234'), (u'INFO', OrderedDict([(u'BEGIN', OrderedDict([(u'@LOAD', u'23')]))])), (u'TRANSACTION', [OrderedDict([(u'@ID', u'2103645570'), (u'ANSWER', u'Hello')]), OrderedDict([(u'@ID', u'4315547431'), (u'ANSWER', u'This is an answer')])])]), OrderedDict([(u'@ID', u'5678'), (u'INFO', OrderedDict([(u'BEGIN', OrderedDict([(u'@LOAD', u'28')]))])), (u'TRANSACTION', [OrderedDict([(u'@ID', u'4099381642'), (u'ANSWER', u'Hello')]), OrderedDict([(u'@ID', u'1220404184'), (u'ANSWER', u'A Different answer')]), OrderedDict([(u'@ID', u'201506542'), (u'ANSWER', u'Yet another one')])])])])]))])

sess_ids = [i['@ID'] for i in doc['MYDATA']['SESSION']]
print sess_ids

sess_loads = [i['INFO']['BEGIN']['@LOAD'] for i in doc['MYDATA']['SESSION']]
print sess_loads

trans_ids = [[j['@ID'] for j in i['TRANSACTION']] for i in doc['MYDATA']['SESSION']]
print trans_ids

Output:

sess_ids:    [u'1234', u'5678']
sess_loads:  [u'23', u'28']
trans_ids:   [[u'2103645570', u'4315547431'], [u'4099381642', u'1220404184', u'201506542']]

You can see that I'm able to access the ID attributes from the SESSION elements and also the LOAD attributes from the BEGIN elements.

I need to get the ID attributes from the TRANSACTION elements as a single list. Currently I'm getting a list of lists in variable trans_ids.

How can I get just a flat list of the values?

I have tried:

[j['@ID'] for j in i['TRANSACTION'] for i in doc['MYDATA']['SESSION']]

but that just repeats the second session twice, giving:

[u'4099381642',
 u'4099381642',
 u'1220404184',
 u'1220404184',
 u'201506542',
 u'201506542']
Andy Madge
  • 624
  • 5
  • 17
  • 1
    Is there any reason why you need to use a list comprehension? There’s nothing wrong with building the result list in more than a single line, maybe with a loop or something. – poke Sep 30 '13 at 16:55
  • Actually, no that's just the best option I've come up with so far. I'm open to better suggestions. – Andy Madge Sep 30 '13 at 17:45
  • I’m not saying it’s better; but if you struggle to get a list comprehension working, it’s certainly an easy way to get you to the result. And it might be more readable than a long one-liner too. – poke Sep 30 '13 at 17:47
  • I am looking for something fairly compact though. This was a cut down example - on the real thing I'm extracting about 30 different attributes at different depths of the XML tree. – Andy Madge Sep 30 '13 at 18:02

3 Answers3

2

Is there a reason you need to go to a dictionary? This sort of thing is fairly straightforward in XML:

import xml.etree.ElementTree as etree
txml = etree.parse('xml string above')
txml.findall('SESSION/TRANSACTION')
[<Element TRANSACTION at 0x4064f9d8>,
 <Element TRANSACTION at 0x4064fa20>,
 <Element TRANSACTION at 0x4064f990>,
 <Element TRANSACTION at 0x4064fa68>,
 <Element TRANSACTION at 0x4064fab0>]
[x.get('ID') for x in txml.findall('SESSION/TRANSACTION')]
['2103645570', '4315547431', '4099381642', '1220404184', '201506542']

At least, it seems more compact to me.

Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
1

I have tried:

[j['@ID'] for j in i['TRANSACTION'] for i in doc['MYDATA']['SESSION']]

You nearly had it. Just reverse the inner for..in parts:

>>> [j['@ID'] for i in doc['MYDATA']['SESSION'] for j in i['TRANSACTION']]
[u'2103645570', u'4315547431', u'4099381642', u'1220404184', u'201506542']

To understand this, take a look at this example:

>>> a = [[1, 2, 3], [4, 5, 6]]
>>> [j for j in i for i in a]
[4, 4, 5, 5, 6, 6]
>>> [j for i in a for j in i]
[1, 2, 3, 4, 5, 6]

When there are multiple for..in parts in a list comprehension, they are evaluated from left to right. So if your look would like this:

for i in a:
    for j in i
        j

Then you have to specify it in the same order, instead of from inner to outer:

[j for i in a for j in i]
poke
  • 369,085
  • 72
  • 557
  • 602
  • Aha that's what I was missing - I assumed it evaluated right to left and I haven't seen the order explained anywhere. Thanks. – Andy Madge Sep 30 '13 at 17:49
  • Strictly speaking this is the correct answer to my question, but @corley-brigman's answer is actually a better solution to my particular problem. – Andy Madge Oct 01 '13 at 15:13
0
from itertools import chain
list(chain(*trans_ids))
snahor
  • 1,162
  • 1
  • 10
  • 16
  • Can you elaborate? Are you saying I do that after the code I already have? What is the * for? – Andy Madge Sep 30 '13 at 17:51
  • 1
    @AndyMadge The `*` [unpacks](http://docs.python.org/2/tutorial/controlflow.html#unpacking-argument-lists) the list and supplies its elements as arguments to the function. See also [this question](http://stackoverflow.com/q/2921847/216074). – poke Oct 01 '13 at 15:26