0

This below is my xml data

<?xml version="1.0" ?>
<ALLRECORDS>
<RECORD>
<RecordType>Forward</RecordType>
<Env SINGLE="Y" TYPE="EntList">
<ENV>
<InputDate>20201118</InputDate>
<Company>CompanyName1</Company>
<Book>Book1</Book>
<Status>DONE</Status>
</ENV>
</Env>
<Back SINGLE="Y" TYPE="EntList">
<BACK>
<PendingUser>User1</PendingUser>
<FairValueLevel>LEVEL2</FairValueLevel>
</BACK>
</Back>
<Financial SINGLE="Y" TYPE="EntList">
<FINANCIAL>
<PorS>S</PorS>
<UserSelectedFlag>SELL</UserSelectedFlag>
<BoughtAmt TYPE="Numeric">860000000</BoughtAmt>
<Valdate>20220513</Valdate>
<Rate TYPE="Numeric">103.0575</Rate>
</FINANCIAL>
</Financial>
</RECORD>
<RECORD>
<RecordType>Forward</RecordType>
<Env SINGLE="Y" TYPE="EntList">
<ENV>
<InputDate>20201119</InputDate>
<Company>CompanyName2</Company>
<Book>Book2</Book>
<Status>DONE</Status>
</ENV>
</Env>
<Back SINGLE="Y" TYPE="EntList">
<BACK>
<PendingUser>User2</PendingUser>
<FairValueLevel>LEVEL3</FairValueLevel>
</BACK>
</Back>
<Financial SINGLE="Y" TYPE="EntList">
<FINANCIAL>
<PorS>S</PorS>
<UserSelectedFlag>SELL</UserSelectedFlag>
<BoughtAmt TYPE="Numeric">860000000</BoughtAmt>
<Valdate>20220512</Valdate>
<Rate TYPE="Numeric">103.09</Rate>
</FINANCIAL>
</Financial>
</RECORD>
</ALLRECORDS>

So, basically, this has a tag ALLRECORDS. Each Record is represented by RECORD. Where it gets interesting is, each RECORD has three sub-tags: Env, Back and Financial.

Question is: How do I create one row of data per RECORD (without worrying about column names, since tags under FINANCIAL would change).

Here is what I have done so far:

import xml.dom.minidom
import xml.etree.ElementTree as ET
import pandas as pd
from os import listdir
from os.path import isfile, join
import datetime


def get_row(record):

    envrecords = record.find('Env')
    env = envrecords.find('ENV')
    InputDate = env.find('InputDate').text
    Company = env.find('Company').text
    Book = env.find('Book').text
    Status = env.find('Status').text

    backrecords = record.find('Back')
    back = backrecords.find('BACK')
    PendingUser = back.find('PendingUser').text
    FairValueLevel = '' if back.find('FairValueLevel') is None else back.find('FairValueLevel').text

    finrecords = record.find('Financial')
    fin = finrecords.find('FINANCIAL')
    PorS = fin.find('PorS').text
    UserSelectedFlag = fin.find('UserSelectedFlag').text
    BoughtAmt = fin.find('BoughtAmt').text
    Valdate = fin.find('Valdate').text
    Rate = fin.find('Rate').text

    row = [InputDate, Company, Book, Status, PendingUser, FairValueLevel,
           PorS, UserSelectedFlag, BoughtAmt, Valdate, Rate]

    return row

def get_xml_data(xmlpath):

    cols = ['InputDate', 'Company', 'Book', 'Status', 'PendingUser', 'FairValueLevel',
            'PorS', 'UserSelectedFlag', 'BoughtAmt', 'Valdate', 'Rate']

    rows = []
    allfiles = [f for f in listdir(xmlpath) if isfile(join(xmlpath, f)) and f.endswith('xml')]
    for filename in allfiles:
        fullfilename = join(xmlpath, filename)
        root = ET.parse(fullfilename).getroot()
        tradetypes = ['RECORD']
        for tt in tradetypes:
            for rec in root.findall(tt):
                eachrow = get_row(rec)
                eachrowdict = dict(zip(cols,eachrow))
                rows.append(eachrowdict)

    return rows

filename = 'C:/Temp/FX/xmldata/stupidsample'
rows = get_xml_data(filename)
for a in rows:
    print(a)

As you can see, I have added an exception for tag "FairValueLevel". Now, in "FINANCIAL" section, I have no idea what new column names would come in. How do I change my code say if under FINANCIAL I see a new tag called "BusinessDate"

Pankaj Singh
  • 526
  • 7
  • 21
  • *One row of data* ... in what data format? XML? JSON? CSV? Also, please post your current attempt. If you do not know how, please [research](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users) likely through the many, many Python XML solutions here on SO and elsewhere, then ask a specific question related to issues with your earnest code attempt. – Parfait Dec 02 '20 at 18:53
  • @Parfait - Hey, I have updates my post per your recommendation. I am basically looking for my method get_xml_data to return rows of dict. I did try looking for couldn't find any solutions. Maybe my search query wasn't the best – Pankaj Singh Dec 04 '20 at 17:00
  • What exactly is your expected output? – Jack Fleeting Dec 04 '20 at 21:50

1 Answers1

0

Consider a dynamic solution without hard-coding nodes should they change. Below merges dictionaries using latest method available in Python 3.5+:

import xml.etree.ElementTree as ET
import pandas as pd

doc = ET.parse("FinancialXMLSingleRow.xml")

df = pd.DataFrame([ {**{i.tag:i.text.strip() for i in rec.findall("Env/ENV/*")},
                     **{i.tag:i.text.strip() for i in rec.findall("Back/BACK/*")},
                     **{i.tag:i.text.strip() for i in rec.findall("Financial/FINANCIAL/*")} }
                           for rec in doc.findall(".//RECORD")])
                    
print(df)
#   InputDate       Company   Book Status PendingUser FairValueLevel PorS UserSelectedFlag  BoughtAmt   Valdate      Rate
# 0  20201118  CompanyName1  Book1   DONE       User1         LEVEL2    S             SELL  860000000  20220513  103.0575
# 1  20201119  CompanyName2  Book2   DONE       User2         LEVEL3    S             SELL  860000000  20220512    103.09
Parfait
  • 104,375
  • 17
  • 94
  • 125