Ensuring groupby output type

Question

Consider this example:

import pandas as pd
import numpy as np
foo = pd.DataFrame(dict(letter=['a', 'a', 'a', 'b', 'b', 'b', 'a', 'b'],
                 number=[1,1,2,2,3,np.nan, np.nan,4]))
grouped = foo.groupby(foo.number)
print grouped['letter'].transform(lambda x: sum(x=='a'))

Out[18]: 
0    2
1    2
2    1
3    1
4    0
5    b
6    a
7    0

Instead of showing 1 on rows 5 and 6, 'a', and 'b' are shown, presumably because the groupby was indexed on a np.nan value. Is there any way to stop this from happening, without replacing nan values with some dummy variable? Also - why does this happen?

Unfortunately, it looks like groups grouped by `nan` are excluded (see `print grouped.groups`). Also see this question: https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values — wflynny, Dec 02 '15 at 22:50

score 1 · Accepted Answer · answered Dec 02 '15 at 22:59

1

The pandas docs explain this here: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

NAN's are excluded this is consistent with R.

Earlier versions of Pandas did include them but they have since been removed.

answered Dec 02 '15 at 22:59

ctrl-alt-delete

3,696
2
24
37

Thanks. So, not possibly without dummy variables. – Hillary Sanders Dec 02 '15 at 23:26

Ensuring groupby output type

1 Answers1