2

I want to parse data from an XML file into an multiindex pandas dataframe. My XML File looks like this:

<?xml version="1.0"?>
<catalog>
   <book name="Documents/Books/German">
      <author>Kerstin Gier</author>
      <title>Rubinrot</title>
   </book>
   <book name="Documents/Articles/English">
      <author>Kim Ralls</author>
      <title>Midnight Rain</title>
   </book>
   <book name="Documents/Books/English">
      <author>Eva Corets</author>
      <title>Maeve Ascendant</title>
   </book>
   <book name="Documents/Books/English">
      <author>Karl Parker</author>
      <title>Worldeater</title>
   </book>
</catalog>

The goal is to store the data from all book tags into an multiindex pandas dataframe which should look like follows:

                              author        title   
Documents  Books     German   Kerstin Gier  Rubinrot        
                     English  Eva Corets    Maeve Ascendant
                              Karl Parker   Worldeater
           Articles  German   Null          Null
                     English  Kim Ralls     Midnight Rain

The index of the multiindex dataframe should be the paths which the attribute "name" contains. I don't want to hardcode any paths because my real world example has many different paths and the multiindex dataframe will have 5-6 dimensions.

My approach so far: I started to create a single index dataframe which looks like so

path                        author        title 
Documents/Books/German      Kerstin Gier  Rubinrot
Documents/Articles/English  Kim Ralls     Midnight Rain
Documents/Books/English     Eva Corets    Maeve Ascendant
Documents/Books/English     Karl Parker   Worldeater

The question is: how can I convert the dataframe into a multiindex dataframe with the path structure as indexes? The problem I see is to change the indexes without loosing the binding to the data.

Sley
  • 55
  • 4
  • 1
    Possible duplicate of [How to convert an XML file to nice pandas dataframe?](https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe) – G. Anderson Mar 26 '19 at 20:37
  • No I need an solution that generates an multiindex dataframe. That's (as far as I noticed) more complicated. Thanks for the advise though. It's a good backup solution. – Sley Mar 27 '19 at 15:10
  • Is it not an option to parse it this way, then create the multiindex after the fact? – G. Anderson Mar 27 '19 at 15:21
  • This might be an option. I edited my question. Thanks for your help so far! – Sley Mar 27 '19 at 15:57
  • Answered based on edit – G. Anderson Mar 27 '19 at 16:23

1 Answers1

2

Based on you original question and edits, here's the follow up solution. Using the parsing from here, and the ideas from here:

We take the existing path column, split it on / and convert it to a list, then use those list values to create new columns.

Then we use those columns as the new index.

df

    path                    author          title
0   Documents/Books/German  Kerstin_Gier    Rubinrot
1   Documents/Articles/English  Kim_Ralls   Midnight_Rain
2   Documents/Books/English Eva_Corets  Maeve_Ascendant
3   Documents/Books/English Karl_Parker Worldeater

df[['cat','type','lang']]=pd.DataFrame(df['path'].str.split('/').values.tolist(), index=df.index)

df

    path                    author          title       cat         type    lang
0   Documents/Books/German  Kerstin_Gier    Rubinrot    Documents   Books   German
1   Documents/Articles/English  Kim_Ralls   Midnight_Rain   Documents   Articles    English
2   Documents/Books/English Eva_Corets  Maeve_Ascendant Documents   Books   English
3   Documents/Books/English Karl_Parker Worldeater  Documents   Books   English

df.set_index(['cat','type','lang'])

                                path                    author          title
cat         type        lang            
Documents   Books       German  Documents/Books/German  Kerstin_Gier    Rubinrot
            Articles    English Documents/Articles/English  Kim_Ralls   Midnight_Rain
            Books       English Documents/Books/English Eva_Corets  Maeve_Ascendant
                        English     Documents/Books/English Karl_Parker Worldeater

From there, obviously you can drop the path if desired

G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • No problem, happy to help out! – G. Anderson Mar 27 '19 at 17:22
  • @G.Anderson - can you explain why when `cat` was a column, it had 4 entries - all `Documents`; once `set_index` made it an index, it has only one, with the other three left blank. Similarly for the `type` column, which - as an index - has one blank. – Jack Fleeting Mar 27 '19 at 20:51
  • 1
    @JackFleeting that's the default display for a multi index in pandas. Because the other values are just repeats of the initial value, it doesn't display them – G. Anderson Mar 27 '19 at 20:54
  • 1
    @JackFleeting [this behavior can be changed](https://stackoverflow.com/questions/38211350/showing-all-index-values-when-using-multiindexing-in-pandas) – G. Anderson Mar 27 '19 at 20:58
  • @G.Anderson - makes sense, but in `lang` as a column, `English` is repeated 3 times in succession; in `type`, `Books` is repeated once, though not in succession. – Jack Fleeting Mar 27 '19 at 20:59
  • 1
    @JackFleeting this, honestly, just _is_ how it displays. The default for a single index (or, in this case, the last layer of a multiindex) is to show all values. If you check the values in the 2nd and 3rd index, you see that they're repeated because they're all different combinations. `[(Books, German),(Articles, English),(Books, English),(Books, English)]` – G. Anderson Mar 27 '19 at 21:05
  • (except the last, where `books` isn't repeated) – G. Anderson Mar 27 '19 at 21:05