aggregation with indices not present in dataframe

Question

df = pd.DataFrame({'x':[1,2,3,4,5,6],'y':[7,8,9,10,11,12],'z':['a','a','a','b','b','b']})
i = pd.Index([0,3,5,10,20])

The indices in i are from a larger dataframe, and df is a subset of that larger dataframe. So there will be indices in i that will not be in df. When I do

df.groupby('z').aggregate({'y':lambda x: sum(x.loc[i])}) #I know I can just use .aggregate({'y':sum}), this is just an example to illustrate my problem

I get this output

   y
z    
a NaN
b NaN

as well as a warning message

__main__:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

How can I avoid this warning message and get the correct output? In my example the only valid indices for df are [0,3,5] so the expected output is:

   y
z    
a  7 #"sum" of index 0 
b  22 #sum of index [3,5]

EDIT

The answers here work great but they do not allow different types of aggregation of x and y columns. For example, let's say I want to sum all elements of x, but for y only sum the elements in index i:

df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[i])})

this is the desired output:

@jezrael: I have updated my question with the expected output — HappyPy, Feb 27 '19 at 14:09
@ScottBoston: Sorry, you are right. I have updated my question — HappyPy, Feb 27 '19 at 14:12

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

Edit for updated question:

df.groupby('z').agg({'x':'sum','y':lambda r: r.reindex(i).sum()})

Output:

Use reindex, to only select those index from i, then dropna to remove all those nans from because indexes in i aren't in df. Then groupyby and agg:

df.reindex(i).dropna(how='all').groupby('z').agg({'y':'sum'})

or, you really don't need to dropna:

df.reindex(i).groupby('z').agg({'y':'sum'})

Output:

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 27 '19 at 14:06

Scott Boston

147,308
15
139
187

This is great, thanks, but I don't really understand why this works. `df.reindex(i)` adds a row with `10 NaN NaN NaN`, so why doesn't the aggregation return `NaN` as in my original problem? – HappyPy Feb 27 '19 at 14:25
Yes, groupby doesn't group on NaN value groups. However, you could use dropna with how='all' to remove those NaN records as I have shown in the first statement. – Scott Boston Feb 27 '19 at 14:28
See SO Post https://stackoverflow.com/a/18431417/6361531 about missing values in groupby. – Scott Boston Feb 27 '19 at 14:29
Thank you for the link. This solution works well, but it won't work for the edited problem in my question... Is there a workaround? – HappyPy Feb 27 '19 at 14:38

jezrael · Answer 2 · 2019-02-27T14:38:36.413

Use intersection with df.index and i for get only matched values and then procees data like need:

print (df.loc[df.index.intersection(i)])
   x   y  z
0  1   7  a
3  4  10  b
5  6  12  b

df = df.loc[df.index.intersection(i)].groupby('z').agg({'y':'sum'})
#comment alternative
#df = df.loc[df.index.isin(i)].groupby('z').agg({'y':'sum'})
print (df)
    y
z    
a   7
b  22

EDIT:

df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.intersection(i)])})
#comment alternative
#df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.isin(i)])})
print (df1)
    x   y
z        
a   6   7
b  15  22

aggregation with indices not present in dataframe

2 Answers2

Edit for updated question: