3

My hypothetical dataframe is

df = pd.DataFrame({'col1':[91,91,91,91,92,92,92,92],
                  'col2':[91,92]*4,'value':[10]*8})
df

   col1 col2 value
0   91  91  10
1   91  92  10
2   91  91  10
3   91  92  10
4   92  91  10
5   92  92  10
6   92  91  10
7   92  92  10

Grouping over the two columns produces those groups:

grouped = df.groupby(['col1','col2'])
grouped.groups
{(91, 91): Int64Index([0, 2], dtype='int64'),
 (91, 92): Int64Index([1, 3], dtype='int64'),
 (92, 91): Int64Index([4, 6], dtype='int64'),
 (92, 92): Int64Index([5, 7], dtype='int64')}

I want to expand this set of groups, so that I can aggregate over the extended selection of groups.
Let’s say I want to add groups produced by

groupedall = df.groupby(['col1'])
groupedall.groups
{91: Int64Index([0, 1, 2, 3], dtype='int64'),
 92: Int64Index([4, 5, 6, 7], dtype='int64')}

This is what I try: I substitute 99 for col2 value (where 99 will mean ‘any’),

groupedall.groups[(91, 99)] = groupedall.groups.pop(91)
groupedall.groups[(92, 99)] = groupedall.groups.pop(92)

And then add those new groups to my original group dict.

grouped.groups.update(groupedall.groups)
grouped.groups
{(91, 91): Int64Index([0, 2], dtype='int64'),
 (91, 92): Int64Index([1, 3], dtype='int64'),
 (91, 99): Int64Index([0, 1, 2, 3], dtype='int64'),
 (92, 91): Int64Index([4, 6], dtype='int64'),
 (92, 92): Int64Index([5, 7], dtype='int64'),
 (92, 99): Int64Index([4, 5, 6, 7], dtype='int64')}

But when I try to group over the grouped object, those newly added groups are omitted.

grouped.sum()
               value
col1    col2    
91      91      20
        92      20
92      91      20
        92      20

And I want the output to include groups that I’ve just added:

               value
col1    col2    
91      91      20
        92      20
        99      40
92      91      20
        92      20
        99      40

What am I missing here?

Pav El
  • 371
  • 2
  • 13

2 Answers2

1

Option 1

df.append(df.assign(col2=99)).groupby(['col1', 'col2']).sum()

           value
col1 col2       
91   91       20
     92       20
     99       40
92   91       20
     92       20
     99       40

Option 2

dummy_series = pd.Series(99, df.index, name='col2')

pd.concat([
    df.groupby(['col1', 'col2']).sum(),
    df.groupby(['col1', dummy_series])[['value']].sum()
]).sort_index()

           value
col1 col2       
91   91       20
     92       20
     99       40
92   91       20
     92       20
     99       40
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Can you bring the 'dummy series' option back?, I want to look at it a bit longer – Pav El Aug 31 '17 at 13:02
  • Thanks. This is very useful suggestion, The reason why I asked about the dummy series is that my data set is over 20K rows, and six cols to aggregate over, so stacking it all on top of itself multiple times may prove to be not as efficient as concating multiple results, so I want to experiment with both ideas. Thank you. – Pav El Aug 31 '17 at 13:09
0

the key thing here seems to be that you want to manually add a group to a GroupByDataFrame.

This appears to work when you look at grouped.groups, but when you look at any other attributes of grouped it's clear the new group isn't being considered as a group.

It doesn't appear to be possible to change a groupbydataframe in this way, but using the link provided by @QuickBeam2k1, you can get the data you need by doing:

df.pivot_table(
    index='col1',
    columns='col2',
    values='value',
    aggfunc='sum',
    margins=True
)

which returns:

col2    91      92      All
col1            
91      20.0    20.0    40.0
92      20.0    20.0    40.0
All     40.0    40.0    80.0
greg_data
  • 2,247
  • 13
  • 20
  • Pivot table approach works fine for two column aggregation, and is a little trickier with tree columns. The problem that I face is that I have 6 columns to aggregate over, with all sorts of intermediate ‘any’, not to mention additional subcategories. For instance, think of some data on students that you have to aggregate over major, age, sex, etc., and then for all those who have declared major (exclude undecided). So I was leaning more towards doing multiple group-by and merging results. – Pav El Aug 31 '17 at 11:46
  • Fair enough. Hope you're able to find some solution for aggregating over so many columns and keeping it looking tidy! – greg_data Aug 31 '17 at 12:27