Transform keeping NULLS

Question

I was to use the transform method on a groupby object using built-in (ie 'mean', 'sum', etc) functions but keep np.nan values. For example,

np.random.seed(0)
df = pd.DataFrame({'value':np.random.randint(0,100,8)},index = list('aabbccdd'))
df.iloc[[0,6]] = np.nan

df.groupby(level=0).transform('min')

yields

   value
a   43.0
a   43.0
b    4.0
b    4.0
c   44.0
c   44.0
d   89.0
d   89.0

but i want:

   value
a   np.nan
a   np.nan
b    4.0
b    4.0
c   44.0
c   44.0
d   np.nan
d   np.nan

Using my own function such as lambda x: min(skipna=True) will work...eventually but I have rather millions of small groups on which lambda and numpy methods takes an eternity. Any suggestions?

Yes, there is a similar question but note that in that question, the OP wants to include np.nan groups whereas I want to not skip over np.nan values in the groups

"NA groups in GroupBy are automatically excluded. This behavior is consistent with R" http://pandas.pydata.org/pandas-docs/stable/missing_data.html#na-values-in-groupby — xyzjayne, Jul 13 '18 at 21:26
Possible duplicate of [groupby columns with NaN (missing) values](https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values) — xyzjayne, Jul 13 '18 at 21:26
@xyzjayne please read the question carefully. Your mark of duplication is including `np.nan` groups, my question is not skipping over `np.nan` values in well-defined groups — Gene Burinsky, Jul 13 '18 at 21:32

score 1 · Answer 1 · edited Jul 13 '18 at 22:21

1

Pandas is NaN-friendly by default. So you will have to revert to NumPy:

import numpy as np

np.random.seed(0)

df = pd.DataFrame({'value':np.random.randint(0,100,8)}, index=list('aabbccdd'))
df.iloc[[0,6]] = np.nan

res = df.groupby(level=0).transform(lambda x: np.min(x.values))

print(res)

   value
a    NaN
a    NaN
b   22.0
b   22.0
c   46.0
c   46.0
d    NaN
d    NaN

Alternatively, use 'min' as normal, and then update where you find null values:

res = df.groupby(level=0).transform('min')
res[df.groupby(level=0)['value'].transform(lambda x: x.isnull().any()).astype('bool')] = np.nan

edited Jul 13 '18 at 22:21

Gene Burinsky

9,478
2
21
28

answered Jul 13 '18 at 21:30

jpp

159,742
34
281
339

thanks! Alas, it has been 10 min since your answer was posted and the solution code is still running on my real data. I guess I'll have to find a workaround. – Gene Burinsky Jul 13 '18 at 21:40
Updated with another solution. Not sure it'll help though if the first solution isn't performant. – jpp Jul 13 '18 at 21:45
two things: 1) on my version of `Pandas`, `df.groupby(level=0)['value'].transform(lambda x: x.isnull().any())` does not return boolean values (booleans are returned as intergers) so the indexing you proposed does not work. Moreover, the recourse to `lambda` function is no faster than your initial solution. Must say, `transform` works in strange ways – Gene Burinsky Jul 13 '18 at 22:15
You can always add `.astype(bool)` to convert integers to Boolean. `transform` confuses me too sometimes. – jpp Jul 13 '18 at 22:16

Gene Burinsky · Accepted Answer · 2018-07-13T22:24:22.433

Because any recourse within transform to numpy functions will painfully slow the code down, my workaround is simply to create an additional column that marks nulls and use transform and a boolean outside of transform to get around the problem:

#mark nulls
df.loc[:,'missValue'] = df.value.isnull()

#use min as usual
res = df.groupby(level=0).value.transform('min') 

#mark resulting values as NAN if one of them is missing
res[(df.groupby(level=0).missValue.transform('mean') == .5)] = np.nan

The above runs in less than a minute while the code proposed by jpp runs over 10 minutes for one iteration of a subset on my actual data set

Transform keeping NULLS

2 Answers2