1

I understand this is no longer supported, but, especially coming from a SQL background, I find it incredibly, incredibly, incredibly (did I mention incredibly?) absurd. Almost all of the data I deal with happens to have nans / nulls, and it is extremely important to me to understand when that's the case, and how many nulls I have in my data.

Is there a way to get round this? The only solution I can think of is:

  1. replace nulls with some other value
  2. save the results of the aggregated functions I want to run to a separate dataframe
  3. reset the index (not clear to me if an index can be null)
  4. change 'some other value' back to nan

However, this poses a number of problems: not only is it cumbersome, but I need to replace nans with a value that is not already present in the data. Does anyone have a better way to do this?

Yes, I know simlar questions have been asked (groupby columns with NaN (missing) values ) but years ago, so I was wondering if new solutions have come up in the meanwhile.

Also, the docs mention that this behaviour is consistent with R, but that's not entirely true, because R has an option NOT to omit nans, which instead pandas does not have.

I have put together a simple example of what I mean:

import pandas as pd
import numpy as np

df=pd.DataFrame()
df['id']= np.repeat( [1,2],3 )
df['x']=1
df.ix[0,'id'] = np.nan

print df

gr = df.groupby('id')

print gr.count()

df['id'].fillna(999, inplace=True)

newgr = df.groupby('id')


#changing groups doesn't seem to have any effect:
newgr.groups[np.nan] = newgr.groups.pop(999)

print 'After the change:'
print newgr.groups
print newgr.count()
print 'id still shows up as 999'

print 'Instead, I can change the final output:'

mycount = newgr.count().reset_index()
mycount.ix[mycount['id'] == 999, 'id'  ] = np.nan
print mycount
Community
  • 1
  • 1
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112

0 Answers0