Lets assume we have an arbitrary XML document like below
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://something.org/schema/s/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd">
<orgUnitId>Organization 1</orgUnitId>
<requiredLevel>academic bachelor</requiredLevel>
<requiredLevel>academic master</requiredLevel>
<programDescriptionText xml:lang="nl">Here is some text; blablabla</programDescriptionText>
<searchword xml:lang="nl">Scrum master</searchword>
</program>
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd">
<requiredLevel>bachelor</requiredLevel>
<requiredLevel>academic master</requiredLevel>
<requiredLevel>academic bachelor</requiredLevel>
<orgUnitId>Organization 2</orgUnitId>
<programDescriptionText xml:lang="nl">Text from another organization about some stuff.</programDescriptionText>
<searchword xml:lang="nl">Excutives</searchword>
</program>
<program xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<orgUnitId>Organization 3</orgUnitId>
<programDescriptionText xml:lang="nl">Also another huge text description from another organization.</programDescriptionText>
<searchword xml:lang="nl">Negotiating</searchword>
<searchword xml:lang="nl">Effective leadership</searchword>
<searchword xml:lang="nl">negotiating techniques</searchword>
<searchword xml:lang="nl">leadership</searchword>
<searchword xml:lang="nl">strategic planning</searchword>
</program>
</programs>
Currently I'm looping
over the elements I need by using their absolute paths, since I'm not able to use any of the get
or find
methods in ElementTree. As such, my code looks like below:
import pandas as pd
import xml.etree.ElementTree as ET
import numpy as np
import itertools
tree = ET.parse('data.xml')
root = tree.getroot()
root.tag
dfcols=['organization','description','level','keyword']
organization=[]
description=[]
level=[]
keyword=[]
for node in root:
for child in
node.findall('.//{http://something.org/schema/s/program}orgUnitId'):
organization.append(child.text)
for child in node.findall('.//{http://something.org/schema/s/program}programDescriptionText'):
description.append(child.text)
for child in node.findall('.//{http://something.org/schema/s/program}requiredLevel'):
level.append(child.text)
for child in node.findall('.//{http://something.org/schema/s/program}searchword'):
keyword.append(child.text)
The goal, of course, is to create one dataframe. However, since each node in the XML file contains one or multiple elements, such as requiredLevel
or searchword
I'm currently losing data when I'm casting it to a dataframe by either:
df=pd.DataFrame(list(itertools.zip_longest(organization,
description,level,searchword,
fillvalue=np.nan)),columns=dfcols)
or using pd.Series
as given here or another solution which I don't seem to get it fit from here
My best bet is not to use Lists at all, since they don't seem to index the data correctly. That is, I lose data from the 2nd to Xth child node. But right now I'm stuck, and don't see any other options.
What my end result should look like is this:
organization description level keyword
Organization 1 .... academic bachelor, Scrum master
academic master
Organization 2 .... bachelor, Executives
academic master,
academic bachelor
Organization 3 .... Negotiating,
Effective leadership,
negotiating techniques,
....