0

I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.

I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.

XML File 1 Example:

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name>John Doe</Name>
            <Date>01/01/2021</Date>
        </Info>
        <App>
            <Description></Description>
            <Type>Two</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

XML File 2 Example:

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name></Name>
            <Date>01/02/2021</Date>
        </Info>
        <App>
            <Description>Sample description here.</Description>
            <Type>One</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
                <DetailOne>3</DetailOne>
                <DetailTwo>4</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there. With the structure of the documents, I am loading three tables so far:

1 to hold the Info name and date 2 to hold the description and type 3 to collect the details

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
            'DetailOne': 'DetailOne',
            'DetailTwo': 'DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)
                

I concat the three tables because I want the Info and App data on each line with each detail. This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.

What I am getting as results is:

Results:

 +------------+----------+-------------+------+-----------+-----------+----------+
| Date       | Name     | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None        | Two  | 1         | 2         | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None        | Two  | 10        | 11        | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+

Results:

 +------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | ('1', '3') | ('2', '4') | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.

What I want is:

+------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | 1          | 2          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 3          | 4          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

I believe XPath may be the way to go but after researching:

https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl

some light reading here: https://www.w3schools.com/xml/xpath_syntax.asp

some more reading here: https://lxml.de/tutorial.html

Any assistance on this is appreciated!

  • Personally, I would choose XSLT for this task. See https://stackoverflow.com/questions/37694540/merge-multiple-xml-files-to-one-single-xml-file-using-xslt for an example. – kimbert Nov 02 '21 at 22:26
  • I forgot to mention in my post that the end goal was to upload the data to a database. What attracted me to PETL is that I have multiple file formats I'm trying to compile into one dataset. I'll take a look into XSLT and see if that might be a solution for me. I was hoping to see how someone might be able to accomplish this with PETL though. – variable89 Nov 02 '21 at 23:39

1 Answers1

0

First, thanks for taking the time to write a good question. I'm happy to spend the time answering it.

I've never used PETL, but I did scan the docs for XML processing. I think your main problem is that the <Details> tag sometimes contains 1 pair of tags, and sometimes multiple pairs. If only there was a way to extract a flat list of the and tag values, without the enclosing tags getting in the way...

Fortunately there is. I used https://www.webtoolkitonline.com/xml-xpath-tester.html and the XPath expression //Details/DetailOne returns the list 1,3,10 when applied to your example XML.

So I suspect that something like this should work:

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
            'DetailOne': '//DetailOne',
            'DetailTwo': '//DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)

The leading // may be redundant. It is XPath syntax for 'at any level in the document'. I don't know how PETL processes the XPath so I'm trying to play safe. I agree btw - the documentation is rather light on details.

kimbert
  • 2,376
  • 1
  • 10
  • 20
  • I appreciate you taking the time to write this out. This had helped me get into the direction I needed. After doing much more extensive research, I found a solution that worked for me but in the end realized that I was using PETL wrong. – variable89 Dec 10 '21 at 21:00