1

I have this input.xml example:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <bathrooms>
        <n35237 type="number">1.0</n35237>
        <n32238 type="number">3.0</n32238>
        <n44699 type="number">nan</n44699>
    </bathrooms>
    <price>
        <n35237 type="number">7020000.0</n35237>
        <n32238 type="number">10000000.0</n32238>
        <n44699 type="number">4128000.0</n44699>
    </price>
    <property_id>
        <n35237 type="number">35237.0</n35237>
        <n32238 type="number">32238.0</n32238>
        <n44699 type="number">44699.0</n44699>
    </property_id>
</root>

that I would like to analyse as a dataframe. The code I used for this is below:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('input.xml')
root = tree.getroot()

def f(elem, result):
    result[elem.tag] = elem.text
    cs = list(elem)
    for c in cs:
        result = f(c, result)
    return result

d = f(root, {})
df=pd.DataFrame(d.items())
print(df)

The thing is that the dataframe is nowhere as the xml file. It shows only the last nodes of the xml, because the nodes' names repeat themselves. How can I view all the xml nodes and their correspondent values, without needing to specify the nodes' names? (so that this can be done for any custom xml)

  • Please search [the archives](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) before asking – hd1 May 19 '21 at 08:08
  • Does this answer your question? [How to convert an XML file to nice pandas dataframe?](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) – Jayvee May 19 '21 at 08:11

2 Answers2

0

Please try below steps.

import pandas_read_xml as pdx
df = pdx.read_xml("test.xml", ['first-tag', 'second-tag', 'the-tag-you-want-as-root'])

By default, pandas-read-xml will treat the root tag as being the "rows" of the pandas dataframe. If this is not true, pass the argument root_is_rows=False

*Sometimes, the XML structure is such that pandas will treat rows vs columns in a way that we think are opposites. For these cases, the read_xml() may fail. Try using transpose=True as an argument in such cases. This argument will only affect the reading if root_is_rows=False is passed.

source: https://pypi.org/project/pandas-read-xml/#:~:text=Read%20XML%20as%20pandas%20dataframe,want%20to%20extract%20the%20data.&text=By%20default%2C%20pandas%2Dread%2D,pass%20the%20argument%20root_is_rows%3DFalse%20.

Hamza usman ghani
  • 2,264
  • 5
  • 19
Venkata
  • 11
  • 4
0

Since your intended rows are grandchildren of root, try a nested dictionary comprehension to pass into DataFrame constructor. Below casts node text with float. Remove if content will not be all decimal point numbers.

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse('Input.xml')

data = {
    node.tag: {i.tag: float(i.text) for i in node.findall("*")}
    for node in tree.findall("*")
}

property_df = pd.DataFrame(data)
property_df
#        bathrooms       price property_id
# n35237       1.0   7020000.0     35237.0
# n32238       3.0  10000000.0     32238.0
# n44699       NaN   4128000.0     44699.0
Parfait
  • 104,375
  • 17
  • 94
  • 125