1

I am working with a Dataframe that has several columns that are XML strings. The XML string has the following structure.

<row>
     <group>
            <date>2021-12-25</date>  
            <ind1>50</ind1>
            ...
            <indN>10</indN>
     </group>
     ...
     <group>
            <date>2021-12-31</date> 
            <ind1>28</ind1>
            ...
            <indN>13</indN>
     </group>
</row>

The dataframe has the following structure:

id     name     indicators                               
1     John     '<?xml version="1.0"?>\n<row>\n  <group>\n    <date>2021-12-25</date> ...'
1     John     '<?xml version="1.0"?>\n<row>\n  <group>\n    <date>2021-12-31</date> ...'                     
...

I want to explode the XML string so that it creates a column for each indicator. I want something that looks like the Dataframe below.

id     name     date          ind1        ...     indN                               
1     John     2021-12-25     50          ...     10     
1     John     2021-12-31     28          ...     13
...

This is what the dataframe looks like

df.head(2).to_dict()
>> {'KEY': {0: 1,
  1: 2},
 'INDICATORS': {0: '<?xml version="1.0"?>\n<row>\n  <group>\n    <date>2017-12-31</date>\n    <ind1>14</ind1>\n    <ind2>24</ind2>\n  </group>\n  <group>\n    <date>2015-12-31</date>\n    <ind1>10</ind1>\n    <ind2>20</ind2>\n </group>\n  <group>\n    <date>2016-12-31</date>\n    <ind1>12</ind1>\n    <ind2>22</ind2>\n  </group>\n</row>\n',
  1: '<?xml version="1.0"?>\n<row>\n  <group>\n    <date>2017-12-31</date>\n    <ind1>0</ind1>\n    <ind2>0</ind2>\n  </group>\n  <group>\n    <date>2015-12-31</date>\n    <ind1>0</ind1>\n    <ind2>0</ind2>\n  </group>\n  <group>\n    <date>2016-12-31</date>\n    <ind1>0</ind1>\n    <ind2>0</ind2>\n </group>\n</row>\n'}}

Do you have any idea?

confused_pandas
  • 336
  • 5
  • 15
  • 1
    how about `pd.read_xml(data)`, but you need to update pandas [ link](https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.3.0.html#read-and-write-xml-documents) – Naga kiran Aug 09 '21 at 10:12
  • 3
    possible duplicate https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe – Naga kiran Aug 09 '21 at 10:13
  • what did you try? please post a better format of your dataframe, try `df.head(2).to_dict()` and paste the output in your post. – Umar.H Aug 09 '21 at 10:13
  • Can you use `beautifulsoup` for parsing? – Andrej Kesely Aug 09 '21 at 10:36
  • I do not think it is a duplicate. In my case, each row of my dataframe corresponds to an entity which has a specific ```id``` assigned. For each entity / id, I need to create one row for each group of indicators. In what you call a possible duplicate, the XML is a file, not a column in a Dataframe. – confused_pandas Aug 09 '21 at 11:50
  • May I ask you - how did you get into a situation where full xml files are inside cells in a dataframe? Also, can you post a representative sample of one of these files as a standalone string? – Jack Fleeting Aug 09 '21 at 12:07
  • Haha, good question! This is how the dataset is, I didn't transform anything. Basically it's a JSON with nested XML. I agree it's not the right way of structuring the data but I just got to deal with it :) – confused_pandas Aug 09 '21 at 12:15

1 Answers1

0

I believe the best to do it is to extract the xml out of your dataframe, parse it properly, extract the target data and recreate a dataframe. There are a few ways of doing this; one of these involve using the new read_xml() method in pandas, but (as you'll see later), this is a multiindex dataframe, and I didn't manage to get it to work in this specific case. Maybe someone else can... Anyway, the following should get you at least close enough to your actual desired output.

Starting with the output of df.head(2).to_dict():

from lxml import etree
#convert to dictionary (you probably don't need to do it)
target = """your output above, type dict""
rows = []

for val in target[ 'INDICATORS'].values():
    #extract the xml and parse it
    doc = etree.XML(val)
    #your sample xml has 3 groups each of which needs to be handled separately
    groups = doc.xpath('//row//group')
    row = []
    for group in groups:
        row.extend(group.xpath('./*/text()'))
    rows.append(row)

Now that you have the data in a list of lists, it's time to recreate the dataframe:

columns = pd.MultiIndex.from_product([['Group1', 'Group2', 'Group3'], ['Date', 'Ind1', 'Ind2']],
                                     names=['Group', 'data'])
df = pd.DataFrame(rows, columns=columns)
df

Output (pardon the formatting):

Group   Group1                  Group2                  Group3
data    Date       Ind1 Ind2 Date   Ind1    Ind2   Date       Ind1  Ind2
0       2017-12-31  14  24   2015-12-31     10  20  2016-12-31  12  22
1       2017-12-31  0   0    2015-12-31     0   0   2016-12-31  0   0
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45