4

I am new in here and I have been trying to create a small python script to convert xml to csv. Based on my reading various post here in Stackoverflow I have managed to come up with a sample code that works just fine.. However the data I am trying to work with has multiple layers and thus I am unsure how to extract the data at the leaf level.

Given below is how the data looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Transmission>
    <TransmissionBody>
        <level1>
            <level2>
                <level3>
                    <level4>
                        <level5>
                            <level6>
                                <ColA>ABC</ColA>
                                <ColB>123</ColB>
                            </level6>
                        </level5>
                    </level4>
                </level3>
            </level2>
        </level1>
    </TransmissionBody>
</Transmission>

I am trying to use the below code to try converting the xml to csv

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
df.to_csv('file.csv)

This code however just pulls level2 and not ColA from level6.

Expected Output:

Transmission,TransmissionBody,level1,level2,level3,level4,level5,level6,ColA,ColB
,,,,,,,,ABC,123
,,,,,,,,DEF,456

Updated code:

allFiles = glob.glob(folder)
for file in allFiles:
    xmllist = [file]
    for xmlfile in xmllist:
        tree = ET.parse(xmlfile)
        root = tree.getroot()

        def f(elem, result):
            result[elem.tag] = elem.text
            cs = elem.getchildren()
            for c in cs:
                result = f(c, result)
            return result

         d = f(root, {})
         df = pd.DataFrame(d, index=['values'])

2 Answers2

2

you have an indentation issue on

if len(elem):

I guess this should resolve it.

fastali
  • 29
  • 7
  • `if len(elem):` is on the baseline, what kind of indentation do you suggest on that line? – vahdet Mar 06 '19 at 08:44
  • What makes you think the if () line is on the baseline? To me it looks to be part of the for elem in root: loop, so it should be indented. – rrd Mar 06 '19 at 08:52
1

If I understood your question correctly, you need to traverse the XML tree, so you probably want to have a recursive function that does that. Something like the following:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
    return result

d = f(root, {})
df = pd.DataFrame(d, index=['values']).T
df

Out:

    values
Transmission    \n
TransmissionBody    \n
level1  \n
level2  \n
level3  \n
level4  \n
level5  \n
level6  \n
ColA    ABC
ColB    123

Update: Here's when we need to do it on multiple XML files. I've added another file similar to the original one with ColA, ColB rows replaced with

<ColA>DEF</ColA>
<ColB>456</ColD>

Here's the code:

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
    return result

result = {}
for file in glob.glob('*.xml'):
    tree = ET.parse(file)
    root = tree.getroot()
    result = f(root, result)

df = pd.DataFrame(result, index=['values']).T
df

And the output:

                    0    1
Transmission       \n   \n
TransmissionBody   \n   \n
level1             \n   \n
level2             \n   \n
level3             \n   \n
level4             \n   \n
level5             \n   \n
level6             \n   \n
ColA              ABC  DEF
ColB              123  456
perl
  • 9,826
  • 1
  • 10
  • 22
  • A doubt: Why are you transposing it? The requirement was to create a csv with headers on the top row, unlike headers on the first column, isn't? – Aakash Basu Jun 19 '19 at 10:28