Extract variables from XML to Pandas

Question

I am working on parsing XML variables to pandas dataframe. The XML files looks like ( This XML file has been simplified for demo)

<Instrm>
    <Rcrd>
        <FinPpt>
            <Id>BT0007YSAWK</Id>
            <FullNm>Turbo Car</FullNm>
            <Ccy>EUR</Ccy>
            <Cmmdty>false</Cmmdty>
        </FinPpt>
        <Issr>529900M2F7D5795H1A49</Issr>
        <Attrbts>
            <Authrty>US</Authrty>
            <Prd>
                <Dt>2002-03-20</Dt>
            </Prd>
            <Ven>NYSE</Ven>
        </Attrbts >
    </Rcrd>
</Instrm>
<Instrm>
    <Rcrd>
        <FinPpt>
            <Id>BX0009YNOYK</Id>
            <FullNm>Turbo truk</FullNm>
            <Ccy>EUR</Ccy>
            <Cmmdty>false</Cmmdty>
        </FinPpt>
        <Issr>58888M2F7D579536J4</Issr>
        <Attrbts>
            <Authrty>UK</Authrty>
            <Prd>
                <Dt>2002-04-21</Dt>
            </Prd>
            <Ven>BOX</Ven>
        </Attrbts >
    </Rcrd>
</Instrm>
...

I attempted to parse this XML file to a dataframe with attributes to be the column names, like this:

     Id          FullNm     Ccy   Cmmdty           Issr              Authrty      Dt         Ven   
BT0007YSAWK     Turbo Car   EUR   false    529900M2F7D5795H1A49       US       2002-03-20    NYSE
BX0009YNOYK     Turbo truk  EUR   false    58888M2F7D579536J4         UK       2002-04-21    BOX
 .....            ......

but still don't know how after I reviewed some post. All I can do is to extract ID in a list, like

import xml.etree.ElementTree as ET
import pandas as pd
import sys

tree = ET.parse('sample.xml')
root = tree.getroot()
report = root[1][0][0]
records = report.findall('Instrm')
ids = []
for r in records:
    ids.append(r[0][0][0].text)

print(ids[0:100])

out:

[BT0007YSAWK, BX0009YNOYK, …….]

I don't quite understand how to utilize 'nodes' here. Can someone help? Thank you.

Does this answer your question? [How to convert an XML file to nice pandas dataframe?](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) — Trenton McKinney, Sep 18 '20 at 16:43
Posted XML is not quite well-formed. Please include the root tag, especially if it includes namespaces. — Parfait, Sep 18 '20 at 16:50
Why bother to convert in the first place? Just use xml tools to extract data. — Jack Fleeting, Sep 18 '20 at 17:51
@JackFleeting can you show me how to extract FullNm, Ccy, Cmmdty and Issr ? Thank you. — FlyUFalcon, Sep 18 '20 at 17:57

score 0 · Answer 1 · answered Sep 18 '20 at 17:06

Assuming a <root> node in posted XML without namespaces, consider building a dictionary via list/dict comprehension and combining sub dictionaries (available in Python 3.5+) that parse to needed nodes. Then call the DataFrame() constructor on returned list of dictionaries.

data = [{**{el.tag:el.text.strip() for el in r.findall('FinPpt/*')},
         **{el.tag:el.text.strip() for el in r.findall('Issr')},
         **{el.tag:el.text.strip() for el in r.findall('Attrbts/*')},
         **{el.tag:el.text.strip() for el in r.findall('Attrbts/Prd/*')} 
        } for r in root.findall('Instrm/Rcrd')]
    
df = pd.DataFrame(data)

score 0 · Answer 2 · answered Sep 18 '20 at 18:16

To get your target data without converting use an xml parser (like lxml) and xpath.

Something along these lines: [note that you have to wrap you xml with a root element]

string = """
<doc>
[your xml above]
</doc>
"""
from lxml import etree
doc = etree.XML(string)
insts = doc.xpath('//Instrm')
for inst in insts:
    f_nams = inst.xpath('//FullNm')
    ccys = inst.xpath('//Ccy')
    cmds = inst.xpath('//Cmmdty')
    issuers = inst.xpath('//Issr')
for a,b,c,d in zip(f_nams,ccys,cmds,issuers):
    print(a.text,b.text,c.text,d.text)

Output:

Turbo Car EUR false 529900M2F7D5795H1A49
Turbo truk EUR false 58888M2F7D579536J4

Extract variables from XML to Pandas

2 Answers2