0

I was to use the transform method on a groupby object using built-in (ie 'mean', 'sum', etc) functions but keep np.nan values. For example,

np.random.seed(0)
df = pd.DataFrame({'value':np.random.randint(0,100,8)},index = list('aabbccdd'))
df.iloc[[0,6]] = np.nan

df.groupby(level=0).transform('min')

yields

   value
a   43.0
a   43.0
b    4.0
b    4.0
c   44.0
c   44.0
d   89.0
d   89.0

but i want:

   value
a   np.nan
a   np.nan
b    4.0
b    4.0
c   44.0
c   44.0
d   np.nan
d   np.nan

Using my own function such as lambda x: min(skipna=True) will work...eventually but I have rather millions of small groups on which lambda and numpy methods takes an eternity. Any suggestions?

Yes, there is a similar question but note that in that question, the OP wants to include np.nan groups whereas I want to not skip over np.nan values in the groups

Gene Burinsky
  • 9,478
  • 2
  • 21
  • 28
  • "NA groups in GroupBy are automatically excluded. This behavior is consistent with R" http://pandas.pydata.org/pandas-docs/stable/missing_data.html#na-values-in-groupby – xyzjayne Jul 13 '18 at 21:26
  • Possible duplicate of [groupby columns with NaN (missing) values](https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values) – xyzjayne Jul 13 '18 at 21:26
  • @xyzjayne please read the question carefully. Your mark of duplication is including `np.nan` groups, my question is not skipping over `np.nan` values in well-defined groups – Gene Burinsky Jul 13 '18 at 21:32

2 Answers2

1

Pandas is NaN-friendly by default. So you will have to revert to NumPy:

import numpy as np

np.random.seed(0)

df = pd.DataFrame({'value':np.random.randint(0,100,8)}, index=list('aabbccdd'))
df.iloc[[0,6]] = np.nan

res = df.groupby(level=0).transform(lambda x: np.min(x.values))

print(res)

   value
a    NaN
a    NaN
b   22.0
b   22.0
c   46.0
c   46.0
d    NaN
d    NaN

Alternatively, use 'min' as normal, and then update where you find null values:

res = df.groupby(level=0).transform('min')
res[df.groupby(level=0)['value'].transform(lambda x: x.isnull().any()).astype('bool')] = np.nan
Gene Burinsky
  • 9,478
  • 2
  • 21
  • 28
jpp
  • 159,742
  • 34
  • 281
  • 339
  • thanks! Alas, it has been 10 min since your answer was posted and the solution code is still running on my real data. I guess I'll have to find a workaround. – Gene Burinsky Jul 13 '18 at 21:40
  • Updated with another solution. Not sure it'll help though if the first solution isn't performant. – jpp Jul 13 '18 at 21:45
  • two things: 1) on my version of `Pandas`, `df.groupby(level=0)['value'].transform(lambda x: x.isnull().any())` does not return boolean values (booleans are returned as intergers) so the indexing you proposed does not work. Moreover, the recourse to `lambda` function is no faster than your initial solution. Must say, `transform` works in strange ways – Gene Burinsky Jul 13 '18 at 22:15
  • You can always add `.astype(bool)` to convert integers to Boolean. `transform` confuses me too sometimes. – jpp Jul 13 '18 at 22:16
0

Because any recourse within transform to numpy functions will painfully slow the code down, my workaround is simply to create an additional column that marks nulls and use transform and a boolean outside of transform to get around the problem:

#mark nulls
df.loc[:,'missValue'] = df.value.isnull()

#use min as usual
res = df.groupby(level=0).value.transform('min') 

#mark resulting values as NAN if one of them is missing
res[(df.groupby(level=0).missValue.transform('mean') == .5)] = np.nan

The above runs in less than a minute while the code proposed by jpp runs over 10 minutes for one iteration of a subset on my actual data set

Gene Burinsky
  • 9,478
  • 2
  • 21
  • 28