I am trying to create Pandas dataframe out of XML. The XML looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Products>
<Info>
<Msg>Shop items.</Msg>
</Info>
<shop shopNr="01">
<ItemNr>1001</ItemNr>
<ItemNr>1002</ItemNr>
<ItemNr>1003</ItemNr>
<ItemNr>1004</ItemNr>
<ItemNr>1010</ItemNr>
</shop>
<shop shopNr="02">
<ItemNr>1002</ItemNr>
<ItemNr>1006</ItemNr>
<ItemNr>1005</ItemNr>
</shop>
<shop shopNr="03">
<ItemNr>1009</ItemNr>
<ItemNr>1006</ItemNr>
<ItemNr>1005</ItemNr>
<ItemNr>1002</ItemNr>
</shop>
</Products>
I have tried using XML Etree as the code below. I have two problems.
First, I can not get the values of ItemNr as children of the root. Instead of getting the value, ie. 1001, I get
<Element 'ItemNr' at 0x000001E2D6C41B38>.
The second problem is when I am creating a dataframe out of the lists. I end up having a list of lists in items. Although the result is empty now since I could not get the values above, I want to end up with a flattened list.
import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops = []
items = []
for node in root.iter('shop'):
shops.append(node.attrib.get('shopNr'))
items.append(list(node))
d = {'shops': shops, 'items': items}
df = pd.DataFrame(d)
The DataFrame produced.
shops items
0 01 [[], [], [], [], []]
1 02 [[], [], []]
2 03 [[], [], [], []]
desired output is:
shops items
0 01 [1001, 1002, 1003, 1004, 1010]
1 02 [1002, 1006, 1005]
2 03 [1009, 1006, 1005, 1002]