3

I have an Excel file with some (mostly) nicely grouped rows. I built a fake example below.

Is there a way to get read_excel in Pandas to produce a multiindex preserving this structure?

investment sample

For this example the MultiIndex would have four levels (Family, Individual, Child (optional), investment). If the subtotal values were lost that would be fine as they can easily be recreated in Pandas.

rhaskett
  • 1,864
  • 3
  • 29
  • 48
  • Can you give a specific example with data? In this instance, a screenshot is useful. For example, include an image of a small table with groupings and what you expect the `pandas` dataframe to look like. – jpp Apr 25 '18 at 09:20
  • 1
    For this example I would say: no, not with `read_excel`. Pandas doesn't have any sense of the "groupings" you've designated in Excel. (i.e. via Data > Group.) – Brad Solomon Apr 25 '18 at 16:34
  • This is what I thought, but if someway existed it would make my life much easier. – rhaskett Apr 25 '18 at 16:38

3 Answers3

2

No, pandas can't read such a structure.

An alternative solution is to use pandas to read your data, but transform this into an easily accessible dictionary, rather than keeping your data in a dataframe with MultiIndex.

There are 2 sensible requirements to make your data more usable:

  1. Make your investment fund names unique. This is trivial.
  2. Convert your Excel grouping to an additional column which indicates the parent of the row.

In the below example, these 2 requirements are assumed.

Setup

from collections import defaultdict
from functools import reduce
import operator
import pandas as pd

df = pd.DataFrame({'name': ['Simpson Family', 'Marge Simpson', 'Maggies College Fund',
                            'MCF Investment 2', 'MS Investment 1', 'MS Investment 2', 'MS Investment 3',
                            'Homer Simpson', 'HS Investment 1', 'HS Investment 3', 'HS Investment 2',
                            'Griffin Family', 'Lois Griffin', 'LG Investment 2', 'LG Investment 3',
                            'Brian Giffin', 'BG Investment 3'],
                   'Value': [600, 450, 100, 100, 100, 200, 50, 150, 100, 50, 0, 200, 150, 100, 50, 50, 50],
                   'parent': ['Families', 'Simpson Family', 'Marge Simpson', 'Maggies College Fund',
                              'Marge Simpson', 'Marge Simpson', 'Marge Simpson', 'Simpson Family',
                              'Homer Simpson', 'Homer Simpson', 'Homer Simpson', 'Families',
                              'Griffin Family', 'Lois Griffin', 'Lois Griffin', 'Griffin Family',
                              'Brian Giffin']})

    Value                  name                parent  
0     600        Simpson Family              Families   
1     450         Marge Simpson        Simpson Family   
2     100  Maggies College Fund         Marge Simpson   
3     100      MCF Investment 2  Maggies College Fund   
4     100       MS Investment 1         Marge Simpson   
5     200       MS Investment 2         Marge Simpson   
6      50       MS Investment 3         Marge Simpson   
7     150         Homer Simpson        Simpson Family   
8     100       HS Investment 1         Homer Simpson   
9      50       HS Investment 3         Homer Simpson   
10      0       HS Investment 2         Homer Simpson   
11    200        Griffin Family              Families   
12    150          Lois Griffin        Griffin Family   
13    100       LG Investment 2          Lois Griffin   
14     50       LG Investment 3          Lois Griffin   
15     50          Brian Giffin        Griffin Family   
16     50       BG Investment 3          Brian Giffin

Step 1

Define a child -> parent dictionary and some utility functions:

child_parent_dict = df.set_index('name')['parent'].to_dict()

tree = lambda: defaultdict(tree)

d = tree()

def get_all_parents(child):

    """Get all parents from hierarchy structure"""

    while child != 'Families':
        child = child_parent_dict[child]
        if child != 'Families':
            yield child

def getFromDict(dataDict, mapList):

    """Iterate nested dictionary"""

    return reduce(operator.getitem, mapList, dataDict)

def default_to_regular_dict(d):

    """Convert nested defaultdict to regular dict of dicts."""

    if isinstance(d, defaultdict):
        d = {k: default_to_regular_dict(v) for k, v in d.items()}
    return d

Step 2

Apply this to your dataframe. Use it to create a nested dictionary structure which will be more efficient for repeated queries.

df['structure'] = df['name'].apply(lambda x: ['Families'] + list(get_all_parents(x))[::-1])

for idx, row in df.iterrows():
    getFromDict(d, row['structure'])[row['name']]['Value'] = row['Value']

res = default_to_regular_dict(d)

Result

Dataframe

    Value                  name                parent  \
0     600        Simpson Family              Families   
1     450         Marge Simpson        Simpson Family   
2     100  Maggies College Fund         Marge Simpson   
3     100      MCF Investment 2  Maggies College Fund   
4     100       MS Investment 1         Marge Simpson   
5     200       MS Investment 2         Marge Simpson   
6      50       MS Investment 3         Marge Simpson   
7     150         Homer Simpson        Simpson Family   
8     100       HS Investment 1         Homer Simpson   
9      50       HS Investment 3         Homer Simpson   
10      0       HS Investment 2         Homer Simpson   
11    200        Griffin Family              Families   
12    150          Lois Griffin        Griffin Family   
13    100       LG Investment 2          Lois Griffin   
14     50       LG Investment 3          Lois Griffin   
15     50          Brian Giffin        Griffin Family   
16     50       BG Investment 3          Brian Giffin   

                                            structure  
0                                          [Families]  
1                          [Families, Simpson Family]  
2           [Families, Simpson Family, Marge Simpson]  
3   [Families, Simpson Family, Marge Simpson, Magg...  
4           [Families, Simpson Family, Marge Simpson]  
5           [Families, Simpson Family, Marge Simpson]  
6           [Families, Simpson Family, Marge Simpson]  
7                          [Families, Simpson Family]  
8           [Families, Simpson Family, Homer Simpson]  
9           [Families, Simpson Family, Homer Simpson]  
10          [Families, Simpson Family, Homer Simpson]  
11                                         [Families]  
12                         [Families, Griffin Family]  
13           [Families, Griffin Family, Lois Griffin]  
14           [Families, Griffin Family, Lois Griffin]  
15                         [Families, Griffin Family]  
16           [Families, Griffin Family, Brian Giffin]

Dictionary

{'Families': {'Griffin Family': {'Brian Giffin': {'BG Investment 3': {'Value': 50},
                                                  'Value': 50},
                                 'Lois Griffin': {'LG Investment 2': {'Value': 100}, 'LG Investment 3': {'Value': 50},
                                                  'Value': 150},
                                 'Value': 200},
              'Simpson Family': {'Homer Simpson': {'HS Investment 1': {'Value': 100}, 'HS Investment 2': {'Value': 0}, 'HS Investment 3': {'Value': 50},
                                                   'Value': 150},
                                 'Marge Simpson': {'MS Investment 1': {'Value': 100}, 'MS Investment 2': {'Value': 200}, 'MS Investment 3': {'Value': 50},
                                                   'Maggies College Fund': {'MCF Investment 2': {'Value': 100},
                                                                            'Value': 100},
                                                   'Value': 450},
              'Value': 600}}}
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Yeah, I ended up needing to define this myself in an additional column in a similar manner. Luckily, I didn't have to do this manually a the file is rather large. I used the values to "roll back" and determine the parent structure (ie "Investment 2" must be part of MCF but "Investment 1" must be part of MS not MCF because MCF has a total of 100 not 200). I needed a list of "investments" in another file in order to do this. – rhaskett Apr 30 '18 at 19:16
1

I don't think it is possible to implement this using read_excel as-it.

What you can do is to add additional columns to your excel sheet based on the four hierarchy levels (Family, Individual, Child (optional), investment) and then use read_excel() with index_col[0,1,2,3] to generate the pandas dataframe.

dportman
  • 1,101
  • 10
  • 20
  • Yeah unfortunately the actual document is some 10000 lines generated by an external source. I'm currently trying to figure out a way to back out the group structure by exploiting the value column. – rhaskett Apr 25 '18 at 17:37
0

See the index_col parameter of the read_excel function.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

index_col : int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

max
  • 4,141
  • 5
  • 26
  • 55
  • thanks Max but the issue is the index groupings are not in multiple columns from my data source. I expanded the example to make this more clear. – rhaskett Apr 25 '18 at 16:04