1

I am trying to create Pandas dataframe out of XML. The XML looks like this:

<?xml version="1.0" encoding="utf-8"?>
<Products>
    <Info>
        <Msg>Shop items.</Msg>
    </Info>
    <shop shopNr="01">
        <ItemNr>1001</ItemNr>
        <ItemNr>1002</ItemNr>
        <ItemNr>1003</ItemNr>
        <ItemNr>1004</ItemNr>
                <ItemNr>1010</ItemNr>
    </shop>
    <shop shopNr="02">
        <ItemNr>1002</ItemNr>
        <ItemNr>1006</ItemNr>
        <ItemNr>1005</ItemNr>
    </shop>
    <shop shopNr="03">
        <ItemNr>1009</ItemNr>
        <ItemNr>1006</ItemNr>
        <ItemNr>1005</ItemNr>
        <ItemNr>1002</ItemNr>
    </shop>
</Products>

I have tried using XML Etree as the code below. I have two problems.

First, I can not get the values of ItemNr as children of the root. Instead of getting the value, ie. 1001, I get

<Element 'ItemNr' at 0x000001E2D6C41B38>.

The second problem is when I am creating a dataframe out of the lists. I end up having a list of lists in items. Although the result is empty now since I could not get the values above, I want to end up with a flattened list.

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()

shops = []
items = []
for node in root.iter('shop'):
    shops.append(node.attrib.get('shopNr'))
    items.append(list(node))

d = {'shops': shops, 'items': items}
df = pd.DataFrame(d)

The DataFrame produced.

 shops                 items
0    01  [[], [], [], [], []]
1    02          [[], [], []]
2    03      [[], [], [], []]

desired output is:


 shops                 items
0    01  [1001, 1002, 1003, 1004, 1010]
1    02          [1002, 1006, 1005]
2    03      [1009, 1006, 1005, 1002]

Ora Aff
  • 640
  • 1
  • 9
  • 24
MKJ
  • 499
  • 1
  • 7
  • 20
  • Possible duplicate of [How do I parse XML in Python?](https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) – Amit Gupta Apr 26 '19 at 06:18

2 Answers2

2

You want to append the text values from the ItemNr elements which are under the shop element to the items list and not the xml Element python object which is what you were doing.

The following code was working for me:

items.append([item_nr_element.text for item_nr_element in node.getchildren()])
cullzie
  • 2,705
  • 2
  • 16
  • 21
2

I hope this is the expected output:

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
    items = []
    shop_Nr = ashop.attrib.get('shopNr')
    for anitem in ashop.iter('ItemNr'):
        items.append(anitem.text)
    shops_items = [shop_Nr,items]
    all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])        
print(df)

Output:

  SHOP_NUMBER                     ITEM_NUMBER
0          01  [1001, 1002, 1003, 1004, 1010]
1          02              [1002, 1006, 1005]
2          03        [1009, 1006, 1005, 1002]

If you want shops with individual items :

import xml.etree.ElementTree as ET
import pandas as pd
data = 'example_shops.xml'
tree = ET.parse(data)
root = tree.getroot()
shops_items = []
all_shops_items = []
for ashop in root.iter('shop'):
    shop_Nr = ashop.attrib.get('shopNr')
    for anitem in ashop.iter('ItemNr'):
        item_Nr = anitem.text
        shops_items = [shop_Nr,item_Nr]
        all_shops_items.append(shops_items)
df = pd.DataFrame(all_shops_items,columns=['SHOP_NUMBER','ITEM_NUMBER'])        
print(df)

output:

   SHOP_NUMBER ITEM_NUMBER
0           01        1001
1           01        1002
2           01        1003
3           01        1004
4           01        1010
5           02        1002
6           02        1006
7           02        1005
8           03        1009
9           03        1006
10          03        1005
11          03        1002
Ora Aff
  • 640
  • 1
  • 9
  • 24
  • Yes the first option is exactly the output. The second option also might help me later. Thanks – MKJ Apr 26 '19 at 18:52