I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.
I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.
XML File 1 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
XML File 2 Example:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there. With the structure of the documents, I am loading three tables so far:
1 to hold the Info name and date 2 to hold the description and type 3 to collect the details
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
I concat the three tables because I want the Info and App data on each line with each detail. This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.
What I am getting as results is:
Results:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
Results:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.
What I want is:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
I believe XPath may be the way to go but after researching:
https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl
some light reading here: https://www.w3schools.com/xml/xpath_syntax.asp
some more reading here: https://lxml.de/tutorial.html
Any assistance on this is appreciated!