1

The code below takes a directory of XMLs files and parses them into a CSV fie. This was possible only for a user in this community. I have learned so much.

from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
from pathlib import Path

directory = 'C:/Users/docs/FolderwithXMLs'

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)

    headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']

    writer.writerow(headers)

    xml_files_list = list(map(str,Path(directory).glob('**/*.xml')))
    for xml_file in xml_files_list:
        tree = ET.parse(xml_file)
        root = tree.getroot()

        start_nodes = root.findall('.//START')
        for sn in start_nodes:
            row = defaultdict(str)


            for k,v in sn.attrib.items():
                row[k] = v

            for rn in sn.findall('.//Rational'):
                row['rational'] = rn.text

            for qu in sn.findall('.//Qualify'):
                row['qualify'] = qu.text

            for ds in sn.findall('.//Description'):
                row['description_txt'] = ds.text
                row['description_num'] = ds.attrib['num']


            for st in sn.findall('.//SetData'):
                for k,v in st.attrib.items():
                    row['set_data_'+ str(k)] = v
                row_data = [row[i] for i in headers]
                writer.writerow(row_data)
                row = defaultdict(str)

The xml files on the other hand have a format likes this

<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
    <START id="ID0001" service_code="0x5196">
      <Docs Docs_type="START">
        <Rational>225196</Rational>
        <Qualify>6251960000A0DE</Qualify>
      </Docs>
      <Description num="1213f2312">The parameter</Description>
      <DataFile dg="12" dg_id="let">
        <SetData value="32" />
      </DataFile>
    </START>
    <START id="DG0003" service_code="0x517B">
      <Docs Docs_type="START">
        <Rational>23423</Rational>
        <Qualify>342342</Qualify>
      </Docs>
      <Description num="3423423f3423">The third</Description>
      <DataFile dg="55" dg_id="big">
        <SetData x="E1" value="21259" />
        <SetData x="E2" value="02" />
      </DataFile>
    </START>
    <START id="ID0048" service_code="0x5198">
      <RawData rawdata_type="ASDS">
        <Rational>225198</Rational>
        <Qualify>343243324234234</Qualify>
      </RawData>
      <Description num="434234234">The forth</Description>
      <DataFile unit="21" unit_id="FEDS">
        <FileX unit="eg" discrete="false" axis_pts="19" name="Vsome" text_id="bx5" unit_id="GDFSD" />
        <SetData xin="5" xax="233" value="323" />
        <SetData xin="123" xax="77" value="555" />
        <SetData xin="17" xax="65" value="23" />
      </DataFile>
    </START>
</FINAL>
</ProjectData>

The results look like the picture below. enter image description here

Recently I have been trying to modify the code, so that the results look similar to the picture bellow. Let’s take id=”ID0048”, the code parses id, service_code only once but it if there are multiple lines of SetData, it will create a new line but it wont repeat the id, service_code and the others. Struggling to achieve something like the picture below

enter image description here

2 Answers2

0

This code should do:

from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
from pathlib import Path

directory = '.'

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)

    headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']

    writer.writerow(headers)

    xml_files_list = list(map(str, Path(directory).glob('**/*.xml')))
    print(xml_files_list)
    for xml_file in xml_files_list:
        tree = ET.parse(xml_file)
        root = tree.getroot()

        start_nodes = root.findall('.//START')
        for sn in start_nodes:
            row = defaultdict(str)

            repeated_values = dict()
            for k,v in sn.attrib.items():
                repeated_values[k] = v

            for rn in sn.findall('.//Rational'):
                repeated_values['rational'] = rn.text

            for qu in sn.findall('.//Qualify'):
                repeated_values['qualify'] = qu.text

            for ds in sn.findall('.//Description'):
                repeated_values['description_txt'] = ds.text
                repeated_values['description_num'] = ds.attrib['num']


            for st in sn.findall('.//SetData'):
                for k,v in st.attrib.items():
                    row['set_data_'+ str(k)] = v
                for key in repeated_values.keys():
                    row[key] = repeated_values[key]
                row_data = [row[i] for i in headers]
                writer.writerow(row_data)
                row = defaultdict(str)

Explanation

The problem is that all nodes of the XML document in the code from the question are getting written to CSV only once. Provided that we have SetData the most repeatable node we can use this to identify the number of times we have to write other data.

This approach would not work if this assumption is incorrect.

Alexander Pushkarev
  • 1,075
  • 6
  • 19
  • can you please provide a screenshot of the result because using the same xml file as above and your code I get only the headers and nothing else –  Jan 06 '20 at 18:32
  • @ebe I can post an output.csv content tomorrow. Works just fine for me. What version of Python do you use? Please also make sure that you specify correct path - in my example, it would look for XML files in the current directory, which not neceserrily what you want. – Alexander Pushkarev Jan 06 '20 at 22:35
  • If you look at the xml file you can see Xaxis. I have been trying to add it but some of its elements are giving me an KeyError. You can see which ones are giving me an error. https://pasteboard.co/IP0gCnK.png –  Jan 08 '20 at 10:57
  • can you please have a look at the comment above. Been stuck at this problem for hours now. Would appreciate it –  Jan 08 '20 at 12:33
  • @ebe I don't really know what might be the reason. I'd suggest having a look at what kind attributes ds.attrib dict have (have a look how it can be done here https://stackoverflow.com/questions/5904969/how-to-print-a-dictionarys-key), maybe it will explain something – Alexander Pushkarev Jan 08 '20 at 12:59
  • the problem is that the first time discrete is showing is in line 400 of this xml files so python is assuming there is no key named this way. When I just ran the code over a small portion of the xml, then I'm not getting any key error. –  Jan 08 '20 at 13:06
  • the problem is appearing because in some segments of the xml these elements are not there. Can you change the code so the column stays blank if example **DataFile dg** is not present in that segment of the xml –  Jan 08 '20 at 14:51
  • 1
    @ebe I would suggest posting this as a separate question as if I am not mistaken it is a different problem already. – Alexander Pushkarev Jan 08 '20 at 15:18
0

Consider the special purpose language, XSLT, using Python's third-party module, lxml, to directly transform XML to CSV output. Specifically, have XSLT pull from the lower level, SetData and retrieve upper level information with ancestor.

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:variable name="delim">,</xsl:variable>
  <xsl:template match="/ProjectData">
      <!------------------------------- HEADERS ------------------------------->
      <xsl:text>id,service_code,rational,qualify,description_num,description,</xsl:text>
      <xsl:text>data_file_dg,data_file_dg_id,data_file_unit,data_file_unit_id,</xsl:text>
      <xsl:text>set_data_x,set_data_xin,set_data_xat,set_data_value&#xa;</xsl:text>
      <!-----------------------------------------------------------------------> 
      <xsl:apply-templates select="descendant::SetData"/>
  </xsl:template>

  <xsl:template match="SetData">
      <xsl:value-of select="concat(ancestor::START/@id, $delim,
                                   ancestor::START/@service_code, $delim,
                                   ancestor::START/*[1]/Rational, $delim,
                                   ancestor::START/*[1]/Qualify, $delim,
                                   ancestor::START/Description/@num, $delim,
                                   ancestor::START/Description, $delim,
                                   ancestor::START/DataFile/@dg, $delim,
                                   ancestor::START/DataFile/@dg_id, $delim,
                                   ancestor::START/DataFile/@unit, $delim,
                                   ancestor::START/DataFile/@unit_id, $delim,
                                   @x, $delim,
                                   @xin, $delim,
                                   @xat, $delim,
                                   @value)"/>
      <xsl:text>&#xa;</xsl:text>
  </xsl:template>

</xsl:stylesheet>

Python (no for loops or if/else logic)

import lxml.etree as et

# LOAD XML AND XSL FILES
xml = et.parse('Input.xml')
xsl = et.parse('Script.xsl')

# INITIALIZE TRANSFORMER
transform = et.XSLT(xsl)

# TRANSFORM INPUT
result = transform(xml)

print(str(result))
# id,service_code,rational,qualify,description_num,description,data_file_dg,data_file_dg_id,data_file_unit,data_file_unit_id,set_data_x,set_data_xin,set_data_xat,set_data_value
# ID0001,0x5196,225196,6251960000A0DE,1213f2312,The parameter,12,let,,,,,,32
# DG0003,0x517B,23423,342342,3423423f3423,The third,55,big,,,E1,,,21259
# DG0003,0x517B,23423,342342,3423423f3423,The third,55,big,,,E2,,,02
# ID0048,0x5198,225198,343243324234234,434234234,The forth,,,21,FEDS,,5,,323
# ID0048,0x5198,225198,343243324234234,434234234,The forth,,,21,FEDS,,123,,555
# ID0048,0x5198,225198,343243324234234,434234234,The forth,,,21,FEDS,,17,,23

# SAVE XML TO CSV
with open('Output.csv', 'wb') as f:
    f.write(str(result))

Online Demo

CSV Output


To loop across a folder of XML files, simply integrate above in a loop. Here wraps all XML processing into a single method to build a list of results via list comprehension and finally written to CSV iteratively. NOTE: For one set of headers, place headers only in CSV and remove from XSLT as indicated above.

import lxml.etree as et
from pathlib import Path

# LOAD XSL SCRIPT
xsl = et.parse('Script.xsl')   # LOAD XML FILE ONCE (REMOVE HEADERS)

def proc_xml(xml_file):     
    xml = et.parse(xml_file)   # LOAD XML FILE  
    transform = et.XSLT(xsl)   # INITIALIZE TRANSFORMER
    result = transform(xml)    # TRANSFORM INPUT    
    return str(result)

xml_files_list = list(map(str,Path(directory).glob('**/*.xml')))
results = [proc_xml(x) for x in xml_files_list]

with open('Output.csv', 'w', newline='') as f:
    f.write('id,service_code,rational,qualify,description_num,description,'
            'data_file_dg,data_file_dg_id,data_file_unit,data_file_unit_id,'
            'set_data_x,set_data_xin,set_data_xat,set_data_value\n')

    # SAVE XML TO CSV
    for r in results:
        f.write(r)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • i would really appreciate if you could edit the code so it take a directory where a lot of xml files are and all of these are parsed into one csv –  Jan 07 '20 at 12:47
  • I did do this. See extended answer at bottom. – Parfait Jan 07 '20 at 13:55