-2

I am working on parsing XML variables to pandas dataframe. The XML files looks like ( This XML file has been simplified for demo)

<Instrm>
    <Rcrd>
        <FinPpt>
            <Id>BT0007YSAWK</Id>
            <FullNm>Turbo Car</FullNm>
            <Ccy>EUR</Ccy>
            <Cmmdty>false</Cmmdty>
        </FinPpt>
        <Issr>529900M2F7D5795H1A49</Issr>
        <Attrbts>
            <Authrty>US</Authrty>
            <Prd>
                <Dt>2002-03-20</Dt>
            </Prd>
            <Ven>NYSE</Ven>
        </Attrbts >
    </Rcrd>
</Instrm>
<Instrm>
    <Rcrd>
        <FinPpt>
            <Id>BX0009YNOYK</Id>
            <FullNm>Turbo truk</FullNm>
            <Ccy>EUR</Ccy>
            <Cmmdty>false</Cmmdty>
        </FinPpt>
        <Issr>58888M2F7D579536J4</Issr>
        <Attrbts>
            <Authrty>UK</Authrty>
            <Prd>
                <Dt>2002-04-21</Dt>
            </Prd>
            <Ven>BOX</Ven>
        </Attrbts >
    </Rcrd>
</Instrm>
...

I attempted to parse this XML file to a dataframe with attributes to be the column names, like this:

     Id          FullNm     Ccy   Cmmdty           Issr              Authrty      Dt         Ven   
BT0007YSAWK     Turbo Car   EUR   false    529900M2F7D5795H1A49       US       2002-03-20    NYSE
BX0009YNOYK     Turbo truk  EUR   false    58888M2F7D579536J4         UK       2002-04-21    BOX
 .....            ......    

but still don't know how after I reviewed some post. All I can do is to extract ID in a list, like

import xml.etree.ElementTree as ET
import pandas as pd
import sys

tree = ET.parse('sample.xml')
root = tree.getroot()
report = root[1][0][0]
records = report.findall('Instrm')
ids = []
for r in records:
    ids.append(r[0][0][0].text)

print(ids[0:100])

out:

[BT0007YSAWK, BX0009YNOYK, …….]

I don't quite understand how to utilize 'nodes' here. Can someone help? Thank you.

FlyUFalcon
  • 314
  • 1
  • 4
  • 18

2 Answers2

0

Assuming a <root> node in posted XML without namespaces, consider building a dictionary via list/dict comprehension and combining sub dictionaries (available in Python 3.5+) that parse to needed nodes. Then call the DataFrame() constructor on returned list of dictionaries.

data = [{**{el.tag:el.text.strip() for el in r.findall('FinPpt/*')},
         **{el.tag:el.text.strip() for el in r.findall('Issr')},
         **{el.tag:el.text.strip() for el in r.findall('Attrbts/*')},
         **{el.tag:el.text.strip() for el in r.findall('Attrbts/Prd/*')} 
        } for r in root.findall('Instrm/Rcrd')]
    
df = pd.DataFrame(data)
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

To get your target data without converting use an xml parser (like lxml) and xpath.

Something along these lines: [note that you have to wrap you xml with a root element]

string = """
<doc>
[your xml above]
</doc>
"""
from lxml import etree
doc = etree.XML(string)
insts = doc.xpath('//Instrm')
for inst in insts:
    f_nams = inst.xpath('//FullNm')
    ccys = inst.xpath('//Ccy')
    cmds = inst.xpath('//Cmmdty')
    issuers = inst.xpath('//Issr')
for a,b,c,d in zip(f_nams,ccys,cmds,issuers):
    print(a.text,b.text,c.text,d.text)

Output:

Turbo Car EUR false 529900M2F7D5795H1A49
Turbo truk EUR false 58888M2F7D579536J4
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45