4

Consider this example:

import pandas as pd
import numpy as np
foo = pd.DataFrame(dict(letter=['a', 'a', 'a', 'b', 'b', 'b', 'a', 'b'],
                 number=[1,1,2,2,3,np.nan, np.nan,4]))
grouped = foo.groupby(foo.number)
print grouped['letter'].transform(lambda x: sum(x=='a'))

Out[18]: 
0    2
1    2
2    1
3    1
4    0
5    b
6    a
7    0

Instead of showing 1 on rows 5 and 6, 'a', and 'b' are shown, presumably because the groupby was indexed on a np.nan value. Is there any way to stop this from happening, without replacing nan values with some dummy variable? Also - why does this happen?

economy
  • 4,035
  • 6
  • 29
  • 37
Hillary Sanders
  • 5,778
  • 10
  • 33
  • 50
  • Unfortunately, it looks like groups grouped by `nan` are excluded (see `print grouped.groups`). Also see this question: https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values – wflynny Dec 02 '15 at 22:50

1 Answers1

1

The pandas docs explain this here: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

NAN's are excluded this is consistent with R.

Earlier versions of Pandas did include them but they have since been removed.

ctrl-alt-delete
  • 3,696
  • 2
  • 24
  • 37