Extracting Multilevel xml into pandas dataframe along with node's absolute path as column names

Question

I have a huge xml with multilevel nodes and I am trying to parse the data into pandas dataframe.

<!-- language: lang-xml -->
`<?xml version="1.0" encoding="UTF-8"?>
<TestCodes>
  <TestCode ID="1">
      <Count>
          <Blant>
            <app_1>1</app_1>
            <app_s>2</app_s>
          </Blant>
          </Count>
        <TestCode>

The XML structure is fixed and doesn't change.I am new to python and XML parsing.I have tried to convert above XML to Dictionary but failed.Any help or inputs will be much appreciated.

Does this answer your question? https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe — dheinz, May 15 '20 at 13:11
@dheinz Thank you, But this doesn't apply to my question because in the above link key,web all are attributes of single node document. But mine is multilevel xml with nested child nodes. — Leo, May 15 '20 at 13:16
Hmm, ok, I'm sorry but then I do not know the answer. I know that there is a method for deeply nested JSON files (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) but I cannot figure out a similar function for xml files. — dheinz, May 15 '20 at 13:19
@dheinz Thank you, I dont mind manually specifying the column names in dataframe but i need a way to parse through this XML — Leo, May 15 '20 at 13:31
@sammywemmyThank you, this is just a sample XML i generated which imitates actual XML but structure remains same. I have generated it just to show levels of parsing required — Leo, May 15 '20 at 14:16

sammywemmy · Answer 1 · 2020-05-18T06:35:27.053

u can make use of parsel to extract the data u want; it has a simple syntax and can help with malformed xml structures :

I used xpath syntax to get the data ... have a look at w3schools xpath syntax for guidance.

summary : - if u r referencing a node, u can use / or // depending on the path u wish to take, and for attributes, u affix the @ symbol. To get the text from the path, attach the text() to it and use the getall() method to get all the values, or get() if u r interested in just the first element.

from parsel import Selector

#if you are reading from file : 
with open('data.xml') as xml:
    data = xml.read()

content = Selector(text=data, type="xml")

mapping = {}
mapping["DeptCode"] = content.xpath("//DeptCode/@ID").getall()

mapping["OCC_TotalCount"] = content.xpath("//OCCCounter/TotalCount/text()").getall()

mapping["Test_app_large"] = content.xpath("//Test//app_large//text()").getall()

mapping["Test_app_small"] = content.xpath("//Test//app_small//text()").getall()

print(mapping)

{'DeptCode': ['1', '2'],
 'OCC_TotalCount': ['1', '1'],
 'Test_app_large': ['1', '1'],
 'Test_app_small': ['2', '2']}

#create dataframe
res = pd.DataFrame(mapping)

  DeptCode  OCC_TotalCount  Test_app_large  Test_app_small
0   1              1              1               2
1   2              1              1               2

Thank you for the answer,I tried this solution and here "content = Selector(text=data, type="xml")" for "data=" I cannot copy the entire XML file contents because XML file is huge. Can I use XML file name instead and parse directly from file ? — Leo, May 18 '20 at 05:28
Thanks a lot, I have tried to implement this solution it works fine but is there any error handling i have to care any page related to error handling concepts in selectors would be helpful. — Leo, May 18 '20 at 09:48
Errors ... cant live with them, cant live without them ... i'll suggest u have a look at the [parsel](https://parsel.readthedocs.io/en/latest/index.html) docs for more info... but if u do get any errors, it would have to be when reading in the file, and that means u have to spend some time with the ```requests``` docs. — sammywemmy, May 18 '20 at 09:55

score 0 · Answer 2 · answered May 16 '20 at 00:54

Another method.

from simplified_scrapy import SimplifiedDoc
html = '''Your xml'''
doc = SimplifiedDoc(html)
for dept in doc.selects('DeptCode'):
    print(dept.ID, dept.TotalCount.text, dept.app_large.text, dept.app_small.text)

Result:

1 1 1 2
1 1 1 2

Extracting Multilevel xml into pandas dataframe along with node's absolute path as column names

2 Answers2