3

I'm trying to create a script to convert nested XML files to a Pandas dataframe. I've found this article https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c, which does a good job getting to the second level (parent, child) but I don't know neither how to get to deeper levels (e.g. grandchildren), nor to get to the attributes of the children (e.g. "neighbor" -> "name").

Here is my XML structure:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
            <neighbor2 name="Italy" direction="S"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

and here is my code:

import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols): 

    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []

    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})

    out_df = pd.DataFrame(rows, columns=df_cols)

    return out_df

xml_file= "example.xml"
df_cols = ["name","year","direction"]

out_df=parse_XML(xml_file, df_cols)
out_df

What I'd like to obtain is a structure like the following:

| name          | year | neighbor name 1 | neighbor direction 1 | neighbor2 name 1 |
|---------------|------|-----------------|----------------------|------------------|
| Liechtenstein | 2008 | Austria         | E                    | Italy            |
|               |      |                 |                      |                  |
|               |      |                 |                      |                  |

The structure needs to be as flexible as possible, so that it would require little editing to be used with different files. I'm getting XML files with different data structures, so I'd like to be able to do some minimal editing every time.

Thanks a lot!!

user3119334
  • 41
  • 1
  • 3

2 Answers2

9

I've made a package for similar use case. It could work here too.

pip install pandas_read_xml

you can do something like

import pandas_read_xml as pdx

df = pdx.read_xml('filename.xml', ['data'])

To flatten, you could

df = pdx.flatten(df)

or

df = pdx.fully_flatten(df)
min
  • 231
  • 2
  • 6
6

You'll need a recursive function to flatten rows, and a mechanism for dealing with duplicate data.

This is messy and depending on the data and nesting, you may end up with rather strange dataframes.

import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd


def flatten_xml(node, key_prefix=()):
    """
    Walk an XML node, generating tuples of key parts and values.
    """

    # Copy tag content if any
    text = (node.text or '').strip()
    if text:
        yield key_prefix, text

    # Copy attributes
    for attr, value in node.items():
        yield key_prefix + (attr,), value

    # Recurse into children
    for child in node:
        yield from flatten_xml(child, key_prefix + (child.tag,))


def dictify_key_pairs(pairs, key_sep='-'):
    """
    Dictify key pairs from flatten_xml, taking care of duplicate keys.
    """
    out = {}

    # Group by candidate key.
    key_map = defaultdict(list)
    for key_parts, value in pairs:
        key_map[key_sep.join(key_parts)].append(value)

    # Figure out the final dict with suffixes if required.
    for key, values in key_map.items():
        if len(values) == 1:  # No need to suffix keys.
            out[key] = values[0]
        else:  # More than one value for this key.
            for suffix, value in enumerate(values, 1):
                out[f'{key}{key_sep}{suffix}'] = value

    return out


# Parse XML with etree
tree = et.XML("""<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
        <neighbor2 name="Italy" direction="S"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
        <cities>
            <city name="Chargin" population="1234" />
            <city name="Firin" population="4567" />
        </cities>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
""")

# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)
print(df)

outputs

            name rank  year   gdppc neighbor-name-1 neighbor-name-2 neighbor-direction-1 neighbor-direction-2 neighbor2-name neighbor2-direction neighbor-name neighbor-direction cities-city-name-1 cities-city-name-2 cities-city-population-1 cities-city-population-2
0  Liechtenstein    1  2008  141100         Austria     Switzerland                    E                    W          Italy                   S           NaN                NaN                NaN                NaN                      NaN                      NaN
1      Singapore    4  2011   59900             NaN             NaN                  NaN                  NaN            NaN                 NaN      Malaysia                  N            Chargin              Firin                     1234                     4567
2         Panama   68  2011   13600      Costa Rica        Colombia                    W                    E            NaN                 NaN           NaN                NaN                NaN                NaN                      NaN                      NaN
AKX
  • 152,115
  • 15
  • 115
  • 172
  • That's amazing, thanks! Now I'd just like to parse the XML file instead of having to copy and paste the XML text into the code. I tried substituting 'tree=et.XML ()' with 'tree=et.parse('example.xml') but I get the following error 'ElementTree' object is not iterable. What should I do differently? – user3119334 Feb 13 '20 at 11:27
  • `tree = et.parse(...).getroot()` – AKX Feb 13 '20 at 14:58