2

Links:

Sample XML

Tools in use:

  • Power Automate Desktop (PAD)
  • Command Prompt (CMD) Session operated by PAD
  • Python 3.10 (importing: Numpy, Pandas & using Element Tree)
  • SQL to read data and insert into primary database

Method:

  • XML to CSV conversion

  • Im using Power Automate Desktop (PAD) to automate all of this because it is what I know.

  • The conversion method uses Python, inside of CMD, importing numpy & pandas, using element tree

Goal(s):

  1. I would like to avoid the namespace being written to headers so SQL can interact with the csv data

  2. I would like to append a master csv file with the new data coming in from every call

  3. This last Goal Item is more of a wishlist and I will likely take it over to a PAD forum. Just including it in the rare event someone has a solution:

  • (I'd like to avoid using CMD all together) but given the limitations of PAD, using IronPython, I have to use a CMD session to utilize regular Python in the CMS session.
  • I am using Power Automate Desktop (PAD), which has a Python module. The only issue is that is uses Iron Python 2.7 & I cannot access the libraries & dont know how to code the Python code below in IronPython. If I could get this sorted out, the entire process would be much more efficient. IronPython Module Operators (Again, this would be a nice to have.. The other goals are the priority.

Issues:

  • xmlns is being added to some of the headers when writing converted data to csv
  • The python code is generating a new .csv file each loop, where I would like for it to simply append the data into a common .csv file, to build the dataset.

Details (Probaably Unnecessary):

First, I am not Python expert and I am pretty novice with SQL as well.

Currently, the Python code (see below) converts a webservice call, body formatted XML, to csv format. Then it exports that .csv data to an actual csv file. This works great but it overwrites the file each time, which means I need to program scripts to read this data before the file is deleted/replaced with the new file. I then use SQL to INSERT INTO a main dataset (Append data) using SQL. It also prints the XMLNS in some of the headers, which is something I need to avoid.

If you are wondering why I am performing the conversion & not simply parsing the xml data, my client requires csv format for their datasets. Otherwise, I'd just parse out the XML data as needed. The other reason is that this data is being updated incrementally at set intervals, building an audit trail. So a bulk conversion is not possible due to

  1. What I would like to do is have the Python code perform the conversion & then append a main dataset inside of a csv file.

  2. The other issue here is that the XMLNS from the XML is being pulled into some (Not all) of the headers of the CSV table, which has made the use of SQL to read and insert into the main table an issue. I cannot figure a way around this (Again, novice)

  3. Also, if anyone knows how one would write this in IronPython 2.7, that would be great too because I'd be able to get around using the CMD Session.

So, if I could use Python to append a primary table with the converted data, while escaping the namespace, this would solve all of my (current) issues & would have the added benefit of being 100% more efficient in the movement of this data.

Also, due to the limited toolset I have, I am scripting this using Power Automate in the CMD, using a CMD session.

Python Code (within CMD environment):

cd\WORK\XML

python

import numpy as np 
import pandas as pd
from xml.etree import ElementTree

maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()

all_tags = list(set([elem.tag for elem in parentroot.iter()]))

rows = []

for child in parentroot:
    temp_dict = {}
    for i in all_tags:
        tag_values = {}
        for inners in child.iter(i):
            temp_tag_value = {}
            temp_dict.update(inners.attrib)
            temp_tag_value[i.rsplit("}", 1)[1]] = inners.text
            tag_values.update(temp_tag_value)
        temp_dict.update(tag_values)
    rows.append(temp_dict)

dataframe = pd.DataFrame.from_dict(rows, orient='columns')
dataframe = dataframe.replace({np.nan: None})
dataframe.to_csv('FILE_TABLE_CMD.csv', index=False)
Doug
  • 23
  • 3
  • Can you post a sample of XML in body of post? We don't need all the XML just enough to know repetitive pattern and anywhere namespaces are defined. – Parfait Oct 15 '22 at 00:45
  • Also, your parsing does not directly parse any specific nodes. Given the nested nature of XML, what are the fields of csv which may be very wide? – Parfait Oct 15 '22 at 00:52
  • The Sample XML is linked at the top Also, I don’t want to parse the xml, I want to convert it and append multiple calls. Basically the linked xml is a single call & I want to convert it to csv, make another call, convert it to csv and append the results over and over again. – Doug Oct 16 '22 at 19:50
  • Linked XML is *very* large. We ask for [mcve] in body of post and not links that may go dead for future readers. – Parfait Oct 16 '22 at 20:58
  • It is a whole conversion, so if you are calling it parsing, I guess my answer would be ‘everything’. All elements’ and attributes’, names and values. – Doug Oct 16 '22 at 21:10

2 Answers2

0

Given no analysis is needed, avoid pandas and consider building the CSV by using csv.DictWriter where you can specify the append mode of write context. Below parses all underlying descendants of <ClinicalData> and migrates each set into CSV row.

from csv import DictWriter
from xml.etree import ElementTree

maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
nmsp = {"doc": "http://www.cdisc.org/ns/odm/v1.3"}

# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))

# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))

# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]

# APPEND TO EXISTING DATA WIH 'a'
with open('FILE_TABLE_CMD.csv', 'a') as f:
    writer = DictWriter(f, fieldnames = all_tags)
    writer.writeheader()

    # ITERATE THROUGH ALL ClincalData ELEMENTS
    for cd in parentroot.findall('doc:ClinicalData', namespaces=nmsp):
        temp_dict = {}

        # ITERATE THROUGH ALL DESCENDANTS
        for elem in cd.iter():
            # UPDATE DICT FOR ELEMENT TAG/TEXT
            temp_dict[elem.tag.split("}", 1)[1]] = elem.text
            # MERGE ELEM DICT WITH ATTRIB DICT
            temp_dict = {**temp_dict, **elem.attrib}

            # REMOVE NAMESPACES IN KEYS
            temp_dict = {
                (k.split('}')[1] if '}' in k else k):v 
                for k,v
                in temp_dict.items()
             }

        # WRITE ROW TO CSV
        writer.writerow(temp_dict)

Actually, you can use the new iterparse feature of pandas.read_xml in latest v1.5. Though intended for very large XML, this feature allows parsing any underlying element or attribute without the restriction of relationships required of XPath.

You will still need to find all element and attributes names. CSV output will differ with above as method removes any all-empty columns and retains order of elements/attributes presented in XML. Also, pandas.DataFrame.csv does support append mode but conditional logic may be needed for writing headers.

import os
from xml.etree import ElementTree
import pandas as pd  # VERSION 1.5+

maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()

# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))

# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))

# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]

clinical_data_df = pd.read_xml(
    "FILE_XML.xml", iterparse = {"ClinicalData": all_tags}, parser="etree"
)

if os.path.exists("FILE_TABLE_CMD.csv"):
    # CREATE CSV WITH HEADERS
    clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False)
else:
    # APPEND TO CSV WITHOUT HEADERS
    clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False, mode="a", header=False)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I was able to use Parfait's code in example #1, using DictWriter. The code runs fine but the output prints: - Headers - Data - Blank Row - Data - Blank Row ..... and on append, it writes the headers again, and they are all out of order, meaning the appended data is also out or order. So, 2 issues but it looks like it wants to work: 1. Empty Rows (No data loss though) 2. Header/Column random order & appending into wrong columns Any ideas guys? – Doug Oct 18 '22 at 20:39
  • I also tried your second suggestion, while using the Get Tags from code above it. It fails with the error 'parentroot' is not defined I appreciate all of the time you took to answer my initial question. If you have time, please let me know what you think about these issues with each code set. Thanks Parfait – Doug Oct 18 '22 at 20:57
  • See [edit](https://stackoverflow.com/posts/74091146/revisions) in how to run second solution. I cannot reproduce the blank rows with XML you posted. Regarding appending and headers out of alignment, it did not occur to me the order of columns. You may need to read the CSV with `DictReader` to retrieve column names in proper order with [csvreader.fieldnames](https://docs.python.org/3/library/csv.html#csv.csvreader.fieldnames) and use that list of columns in `DictWriter`. – Parfait Oct 18 '22 at 20:59
  • I see your issue with blank rows which tend to occur on Windows machines. See [CSV file written with Python has blank lines between each row](https://stackoverflow.com/q/3348460/1422451) to use `newline=''`. – Parfait Oct 18 '22 at 21:18
  • Ahh, thanks for everything Parfait :) – Doug Oct 19 '22 at 03:20
0

I recommend to split large XML in branches and parse this parts separately. This can be done in an object. The object can also hold the data until written to the csv or a database like sqlite3, MySQL, etc.. The object can be called also from different threat.

I have not define the csv write, because I don't know which data you like to catch. But I think you will finish this easy.

Her is my recommended concept:

import xml.etree.ElementTree as ET
import pandas as pd
#import re

class ClinicData:
    def __init__(self, branch):
        self.clinical_data = []
        self.tag_list = []
        for elem in branch.iter():
            self.tag_list.append(elem.tag)
            #print(elem.tag)
                          
    def parse_cd(self, branch):
        for elem in branch.iter():
            if elem.tag in self.tag_list:
                print(f"{elem.tag}--->{elem.attrib}")   
            if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}AuditRecord":
                AuditRec_val = pd.read_xml(ET.tostring(elem))
                print(AuditRec_val)    
        branch.clear()
     
    
def main():
    """Parse each clinic data into the class """
 
    xml_file = 'Doug_s.xml'
    #tree = ET.parse(xml_file)
    #root = tree.getroot()
    #ns = re.match(r'{.*}', root.tag).group(0)
    #print("Namespace:",ns)
      
    parser = ET.XMLPullParser(['end'])
    with open(xml_file, 'r', encoding='utf-8') as et_xml:
        for line in et_xml:
            parser.feed(line)
    
    for event, elem in parser.read_events():
        if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}ClinicalData" and event =='end':
            #print(event, elem.tag)
            elem_part = ClinicData(elem)
            elem_part.parse_cd(elem)
   
if __name__ == "__main__":
    main()
Hermann12
  • 1,709
  • 2
  • 5
  • 14