0

I have a pivot_table generated DataFrame with a single index for its rows, and a MultiIndex for its columns. The top level of the MultiIndex is the name of the data I am running calculations on, and the second level is the DATE of that data. The values are the result of those calculations. It looks like this:

Imgur link - my reputation not high enough to post inline images

I am trying to group this data by quarters (Q42018, for example), instead of every single day (the native format of the data).

I found this post that uses PeriodIndex and GroupBy to convert an index of dates into an index of quarters/years to be quite elegant and make the most sense.

The problem is that this solution is for a dataframe with only single index columns. I'm running into a problem trying to do this because my columns are a multi-index, and I can't figure out how to get it to work. Here is my attempt thus far:

bt = cleaned2018_df.pivot_table(index='Broker',
                                values=['Interaction Id','Net Points'],
                                columns='Date',
                                aggfunc={'Interaction Id':pd.Series.nunique,
                                         'Net Points':np.sum}, 
                                fill_value=0)

pidx = pd.PeriodIndex(bt.columns.levels[1], freq='Q')
broker_qtr_totals = bt.groupby(pidx, axis=1, level=1).sum()

As you can see, I'm grabbing the second level of the MultiIndex that contains all the dates, and running it through the PeriodIndex function to get back an index of quarters. I then pass that PeriodIndex into groupby, and tell it to operate on columns and the second level where the dates are.

This returns a ValueError response of Grouper and axis must be same length. And I know the reason is because the pidx value I'm passing in to the GroupBy is of length x, whereas the column axis of the dataframe is length 2x (since the 1st level of the multiindex has 2 values).

I'm just getting hung up on how to properly apply this to the entire index. I can't seem to figure it out syntactically, so I wanted to rely on the community's expertise to see if someone could help me out.

If my explanation is not clear, I'm happy to clarify further. Thank you in advance.

riptusk331
  • 369
  • 4
  • 9

1 Answers1

0

I figured this out, and am going to post the answer in case anyone else with a similar problem lands here. I was thinking about the problem correctly, but had a few errors in my first attempt.

The length error was due to me passing an explicit reference to the 2nd level of the MultiIndex into the PeriodIndex function, and then passing that into groupby. The better solution is to use the .get_level_values function, as this takes into account the multi-level nature of the index and returns the appropriate # of values based on how many items are in higher levels.

For instance - if you have a DataFrame with MultiIndex columns with 2 levels - and those 2 levels each contain 3 values, your table will have 9 columns, as the lower level is broken out for each value in the top level. My initial solution was just grabbing those 3 values from the second level directly, instead of all 9. get_level_values corrects for this.

The second issue was that I was passing just this PeriodIndex object by itself into the groupby. That will work, but then it basically just disregards the top level of the MultiIndex. So you need to make sure to pass in a list that contains the original top level, and your new second level that you want to group by.

Corrected code:

#use get_level_values instead of accessing levels directly
pIdx = pd.PeriodIndex(bt.columns.get_level_values(1), freq='Q')

# to maintain original grouping, pass in a list of your original top level, 
# and the new second level
broker_qtr_totals = bt.groupby(by=[bt.columns.get_level_values(0), pidx],
                               axis=1).sum()

This works

imgur link to dataframe image as my rep is too low

riptusk331
  • 369
  • 4
  • 9