Extracting nested XML elements of different sizes into Pandas

Question

Lets assume we have an arbitrary XML document like below

<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://something.org/schema/s/program">
   <program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
            xsi:schemaLocation="http://something.org/schema/s/program  http://something.org/schema/s/program.xsd">
      <orgUnitId>Organization 1</orgUnitId>
      <requiredLevel>academic bachelor</requiredLevel>
      <requiredLevel>academic master</requiredLevel>
      <programDescriptionText xml:lang="nl">Here is some text; blablabla</programDescriptionText>
      <searchword xml:lang="nl">Scrum master</searchword>
   </program>
   <program xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
            xsi:schemaLocation="http://something.org/schema/s/program  http://something.org/schema/s/program.xsd">
      <requiredLevel>bachelor</requiredLevel>
      <requiredLevel>academic master</requiredLevel>
      <requiredLevel>academic bachelor</requiredLevel>
      <orgUnitId>Organization 2</orgUnitId>
      <programDescriptionText xml:lang="nl">Text from another organization about some stuff.</programDescriptionText>
      <searchword xml:lang="nl">Excutives</searchword>
   </program>
   <program xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <orgUnitId>Organization 3</orgUnitId>
      <programDescriptionText xml:lang="nl">Also another huge text description from another organization.</programDescriptionText>
      <searchword xml:lang="nl">Negotiating</searchword>
      <searchword xml:lang="nl">Effective leadership</searchword>
      <searchword xml:lang="nl">negotiating techniques</searchword>
      <searchword xml:lang="nl">leadership</searchword>
      <searchword xml:lang="nl">strategic planning</searchword>
   </program>
</programs>

Currently I'm looping over the elements I need by using their absolute paths, since I'm not able to use any of the get or find methods in ElementTree. As such, my code looks like below:

import pandas as pd
import xml.etree.ElementTree as ET   
import numpy as np
import itertools

tree = ET.parse('data.xml')
root = tree.getroot()
root.tag

dfcols=['organization','description','level','keyword']
organization=[]
description=[]
level=[]
keyword=[]

for node in root:
    for child in 
       node.findall('.//{http://something.org/schema/s/program}orgUnitId'):
        organization.append(child.text) 
    for child in node.findall('.//{http://something.org/schema/s/program}programDescriptionText'):
        description.append(child.text) 
    for child in node.findall('.//{http://something.org/schema/s/program}requiredLevel'):
        level.append(child.text)
    for child in node.findall('.//{http://something.org/schema/s/program}searchword'):
        keyword.append(child.text)

The goal, of course, is to create one dataframe. However, since each node in the XML file contains one or multiple elements, such as requiredLevel or searchword I'm currently losing data when I'm casting it to a dataframe by either:

df=pd.DataFrame(list(itertools.zip_longest(organization,
    description,level,searchword,
    fillvalue=np.nan)),columns=dfcols)

or using pd.Series as given here or another solution which I don't seem to get it fit from here

My best bet is not to use Lists at all, since they don't seem to index the data correctly. That is, I lose data from the 2nd to Xth child node. But right now I'm stuck, and don't see any other options.

What my end result should look like is this:

organization    description  level                keyword
Organization 1  ....         academic bachelor,   Scrum master
                             academic master 
Organization 2  ....         bachelor,            Executives
                             academic master, 
                             academic bachelor    
Organization 3  ....                              Negotiating,
                                                  Effective leadership,
                                                  negotiating techniques,
                                                  ....

score 2 · Accepted Answer · answered Apr 24 '19 at 16:13

Consider building a list of dictionaries with comma-collapsed text values. Then pass list into the pandas.DataFrame constructor:

dicts = []
for node in root:
    orgs = ", ".join([org.text for org in node.findall('.//{http://something.org/schema/s/program}orgUnitId')])
    desc = ", ".join([desc.text for desc in node.findall('.//{http://something.org/schema/s/program}programDescriptionText')])
    lvls = ", ".join([lvl.text for lvl in node.findall('.//{http://something.org/schema/s/program}requiredLevel')])
    wrds = ", ".join([wrd.text for wrd in node.findall('.//{http://something.org/schema/s/program}searchword')])

    dicts.append({'organization': orgs, 'description': desc, 'level': lvls, 'keyword': wrds})

final_df = pd.DataFrame(dicts, columns=['organization','description','level','keyword'])

Output

print(final_df)
#      organization                                        description                                         level                                            keyword
# 0  Organization 1                       Here is some text; blablabla            academic bachelor, academic master                                       Scrum master
# 1  Organization 2   Text from another organization about some stuff.  bachelor, academic master, academic bachelor                                          Excutives
# 2  Organization 3  Also another huge text description from anothe...                                                Negotiating, Effective leadership, negotiating...

Although both came with a possible solution to my answer, I've got to admit that the last one works in my case. With the first answer I kept running into several errors within the function itself. The last accepted answer as a solution works neatly. There is, however, one small catch which is fixed easily: if the data has attributes of `NoneType` and throws an error a line could be changed to; `desc = ", ".join([str(desc.text) for desc in node.findall('.//{xml_path}Element')])` Thanks for the support both of you — Wokkel, Apr 25 '19 at 08:26

JoergVanAken · Answer 2 · 2019-04-24T14:48:48.600

A lightweight xml_to_dict converter can be found here. It can be improved by this to handle namespaces.

def xml_to_dict(xml='', remove_namespace=True):
    """Converts an XML string into a dict

    Args:
        xml: The XML as string
        remove_namespace: True (default) if namespaces are to be removed

    Returns:
        The XML string as dict

    Examples:
        >>> xml_to_dict('<text><para>hello world</para></text>')
        {'text': {'para': 'hello world'}}

    """
    def _xml_remove_namespace(buf):
        # Reference: https://stackoverflow.com/a/25920989/1498199
        it = ElementTree.iterparse(buf)
        for _, el in it:
            if '}' in el.tag:
                el.tag = el.tag.split('}', 1)[1]
        return it.root

    def _xml_to_dict(t):
        # Reference: https://stackoverflow.com/a/10077069/1498199
        from collections import defaultdict

        d = {t.tag: {} if t.attrib else None}
        children = list(t)
        if children:
            dd = defaultdict(list)
            for dc in map(_xml_to_dict, children):
                for k, v in dc.items():
                    dd[k].append(v)
            d = {t.tag: {k: v[0] if len(v) == 1 else v for k, v in dd.items()}}

        if t.attrib:
            d[t.tag].update(('@' + k, v) for k, v in t.attrib.items())

        if t.text:
            text = t.text.strip()
            if children or t.attrib:
                if text:
                    d[t.tag]['#text'] = text
            else:
                d[t.tag] = text

        return d

    buffer = io.StringIO(xml.strip())
    if remove_namespace:
        root = _xml_remove_namespace(buffer)
    else:
        root = ElementTree.parse(buffer).getroot()

    return _xml_to_dict(root)

So let s be the string which holds your xml. We can convert it to a dict via

d = xml_to_dict(s, remove_namespace=True)

Now the solution is straight forward:

rows = []
for program in d['programs']['program']:
    cols = []
    cols.append(program['orgUnitId'])
    cols.append(program['programDescriptionText']['#text'])
    try:
        cols.append(','.join(program['requiredLevel']))
    except KeyError:
        cols.append('')

    try:
         searchwords = program['searchword']['#text']
    except TypeError:
         searchwords = []
         for searchword in program['searchword']:
            searchwords.append(searchword['#text'])
         searchwords = ','.join(searchwords)
    cols.append(searchwords)

    rows.append(cols)

df = pd.DataFrame(rows, columns=['organization', 'description', 'level', 'keyword'])

After looking some time at the code I can't seem to get a grip on it. So I'm wondering what kind of output Python throws at you from the `df` . In my case I'm running stuck on several name and attribute errors. In addition, following the first link you provided I was able to pass the xml into a dictionary. However, running the `for loop` provided does not work. Thanks for your comment, will check it out. Wasn't aware of importing another module. — Wokkel, Apr 24 '19 at 14:45
You don't have to follow the link, I posted the implementation of `xml_to_dict`, too. I just wanted to make clear, that it's not "my" code. — JoergVanAken, Apr 24 '19 at 14:50

Extracting nested XML elements of different sizes into Pandas

2 Answers2