Consider this example:
import pandas as pd
import numpy as np
foo = pd.DataFrame(dict(letter=['a', 'a', 'a', 'b', 'b', 'b', 'a', 'b'],
number=[1,1,2,2,3,np.nan, np.nan,4]))
grouped = foo.groupby(foo.number)
print grouped['letter'].transform(lambda x: sum(x=='a'))
Out[18]:
0 2
1 2
2 1
3 1
4 0
5 b
6 a
7 0
Instead of showing 1
on rows 5
and 6
, 'a'
, and 'b'
are shown, presumably because the groupby was indexed on a np.nan
value. Is there any way to stop this from happening, without replacing nan
values with some dummy variable? Also - why does this happen?