I understand this is no longer supported, but, especially coming from a SQL background, I find it incredibly, incredibly, incredibly (did I mention incredibly?) absurd. Almost all of the data I deal with happens to have nans / nulls, and it is extremely important to me to understand when that's the case, and how many nulls I have in my data.
Is there a way to get round this? The only solution I can think of is:
- replace nulls with some other value
- save the results of the aggregated functions I want to run to a separate dataframe
- reset the index (not clear to me if an index can be null)
- change 'some other value' back to nan
However, this poses a number of problems: not only is it cumbersome, but I need to replace nans with a value that is not already present in the data. Does anyone have a better way to do this?
Yes, I know simlar questions have been asked (groupby columns with NaN (missing) values ) but years ago, so I was wondering if new solutions have come up in the meanwhile.
Also, the docs mention that this behaviour is consistent with R, but that's not entirely true, because R has an option NOT to omit nans, which instead pandas does not have.
I have put together a simple example of what I mean:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['id']= np.repeat( [1,2],3 )
df['x']=1
df.ix[0,'id'] = np.nan
print df
gr = df.groupby('id')
print gr.count()
df['id'].fillna(999, inplace=True)
newgr = df.groupby('id')
#changing groups doesn't seem to have any effect:
newgr.groups[np.nan] = newgr.groups.pop(999)
print 'After the change:'
print newgr.groups
print newgr.count()
print 'id still shows up as 999'
print 'Instead, I can change the final output:'
mycount = newgr.count().reset_index()
mycount.ix[mycount['id'] == 999, 'id' ] = np.nan
print mycount