Reading XML that uses attributes

Question

A noob question on XML. I am using ElementTree to convert the XML to a df and eventually upload to a database. My problem is that the XML format is not standard. All ElementTree examples that I have seen use a different format. It looks something like this:

<session session_id="9">
    <party party_id="1" name="party 1">
        <member member_id="1" name="bob"/>
        <member member_id="2" name="simon"/>
    </party>
    <party party_id="2" name="party 2">
        <member member_id="3" name="diana"/>
        <member member_id="4" name="pablo"/>
    </party>
</session>

For one thing, all information is written in attributes, which is not a big issue because I can still fetch them. However I want to pick up the attributes not only of the member elements, but also of their parents. So the result should look something like this:

member_id	member_name	party_id	session_id
1	bob	1	9
2	simon	1	9
3	diana	2	9
4	pablo	2	9

I use children = list(root.iter()) to list all children and then append their attributes to a dataframe. However I lose the link to parent, so I cannot really say which party branch the member came from, so I cannot assign the right party_id.

I am wondering whether there is an easy way to get a dataframe out of this XML structure?

Welcome to Stack Overflow. Show us code in the form of a [mcve]. Regarding getting the parent, see https://stackoverflow.com/q/24239435/407651, https://stackoverflow.com/q/2170610/407651 — mzjn, Dec 08 '21 at 08:55
Find all the `party` tags and for each one iterate through their `member`, i.e. so you know the parent when looking at a child. Strangely enough there's a `findall()` methind in ElementTree, look at the documentation. — DisappointedByUnaccountableMod, Dec 08 '21 at 09:33
FYI there's nothing "non-standard" about your XML. It's just... xml. — DisappointedByUnaccountableMod, Dec 08 '21 at 09:52
Just because you, as a newbie, haven't seen XML like this before doesn't make it "non-standard". You really should be reconsidering your learning strategy: read about the technologies you are using to learn their concepts, rather than being taken by surprise every time you encounter something new. — Michael Kay, Dec 08 '21 at 10:58

Sven S · Answer 1 · 2021-12-08T13:11:58.363

0

Your xml is fully valid and well formed, but the example below indicates how xml is usually structured.

<parent attribute="supporting information">
    Information
    <child attribute="supporting information">
        Information
    </child>
</parent>

Alas, ElementTree does not provide a neat way of converting attributes to tag content/information.

Solution
You are half-way there (or possibly a third if we count the root element 'session'). You will have to iterate per xml-element level. (e.g. for every party element, iterate every member).

edited Dec 08 '21 at 13:11

answered Dec 08 '21 at 09:12

Sven S

11
4

The XML specification https://www.w3.org/TR/xml/ specifies the syntax of XML, and your example of "supporting information" and "information" is exactly that: an example i.e. it's not part of the specification, e.g. "supporting" isn't in the specification. So XML _doesn't_ specify the distinction you're making - it can be used that way or not that way - both are valid. – DisappointedByUnaccountableMod Dec 08 '21 at 10:07
Fair point @balmy, I'll edit the answer. – Sven S Dec 08 '21 at 13:09
As the OP's example perhaps shows, there may not be a "usual" way to structure XML, just different ways that are "usual" or suited for different data and purposes. – DisappointedByUnaccountableMod Dec 08 '21 at 13:28

score 0 · Accepted Answer · answered Dec 08 '21 at 11:35

something like this

import xml.etree.ElementTree as ET
import pandas as pd


xml = '''<session session_id="9">
    <party party_id="1" name="party 1">
        <member member_id="1" name="bob"/>
        <member member_id="2" name="simon"/>
    </party>
    <party party_id="2" name="party 2">
        <member member_id="3" name="diana"/>
        <member member_id="4" name="pablo"/>
    </party>
</session>'''

data = []
root = ET.fromstring(xml)
session_id = root.attrib['session_id']
for party in root.findall('party'):
  for memeber in party.findall('member'):
    data.append({'session_id':session_id,'party_id': party.attrib['party_id']})
    data[-1]['member_name'] = memeber.attrib['name']
    data[-1]['member_id'] = memeber.attrib['member_id']
df = pd.DataFrame(data)
print(df)

output

  session_id party_id member_name member_id
0          9        1         bob         1
1          9        1       simon         2
2          9        2       diana         3
3          9        2       pablo         4

Why to use `-1` index instead of appending element after all data collected? — Olvin Roght, Dec 08 '21 at 19:15

Reading XML that uses attributes

2 Answers2