1

I have a xml file like:

<plays format="tokens">
    <period number="1">
      <play/>
      <play/>
      <play/>
    </period>
    <period number="2">
      <play/>
      <play/>
      <play/>
    </period>

Each play tag contains a bunch of variables, but I would also like to add the period number as a variable to the play tags. My goal is to produce a table with each play and their attributes as well as a column that says which period that played occurred in (1 or 2).

My current code to flatten the plays out is:

d = []
for play in root.iter('play'):
    d.append(play.attrib)
    
df = pd.DataFrame(d)

This gives me every play and their attributes in the table df, but the period is not currently included in this table. Any direction would help, thank you!

bblackburn
  • 57
  • 4

1 Answers1

1

You can do it this way with ElementTree like below-

plays.xml

<plays format="tokens">
    <period number="1">
      <play attr1="abc" attr2="ddd"/>
      <play attr1="cbc" attr2="ddd"/>
      <play attr1="dbc" attr2="ddd"/>
    </period>
    <period number="2">
      <play attr1="abc" attr2="ddd"/>
      <play attr1="dbc" attr2="ddd"/>
      <play attr1="kbc" attr2="ddd" />
    </period>
</plays>

main.py

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('plays.xml')
root = tree.getroot()

# find the period number for each play by searching for the parent period element
periods = []
for period in root.iter('period'):
    number = period.attrib['number']
    for play in period.iter('play'):
        other_attr = play.attrib
        # this line merges the other_attributes of play element(attr1, attr2) with the top attribute(number) of period element, see reference: https://stackoverflow.com/a/62820532/1138192 
        periods.append({**{"number": number}, **other_attr})

df = pd.DataFrame(periods)
print(df)

Output:

  number attr1 attr2
0      1   abc   ddd
1      1   cbc   ddd
2      1   dbc   ddd
3      2   abc   ddd
4      2   dbc   ddd
5      2   kbc   ddd
A l w a y s S u n n y
  • 36,497
  • 8
  • 60
  • 103
  • It worked thank you so much! Do you think you could explain exactly what you're doing in the 'periods.append({**{"number": number}, **other_attr})' line just so I can learn? Thanks again! – bblackburn Dec 13 '22 at 19:17
  • 1
    @bblackburn added a comment above that line, to summarize: we are just merging the play element attributes with is a dictionary i.e `{"attr1":"abc", "attr2":"ddd"}` with period attributes which also a dictionary `{"number":"1"}` , so when we do `{**x, **y}`, it merges both dictionaries to single flat format. so final result will be like `{'number': '1', 'attr1': 'abc', 'attr2': 'ddd'}` and with `periods.append({**x, **y})` we are appending to the list. So later we can use it to make our pandas dataframe easily – A l w a y s S u n n y Dec 13 '22 at 19:26