Why am I losing data after I perform a GroupBy?

Question

So this dataset has 2 million records of patients. I've been asked to make every variable dichotomic, and that part is done, but any patient can have multiple records so I have to group them by the patient. When I perform this I lose data; any idea why? This doesn't happen in every field:

I am adding an image of a sample dataframe to perfom the groupby('npaciente), and then you can see that for each column in valu_ counts it is not returnig anything in the complicacionescronicas column

Are you aware that `value_counts()` by default does not report `NaN`s? — DYZ, Aug 02 '17 at 07:34
print(g.head()) its super large, 169 cols, but for those 2 that i show the 5 top rows are all NaN, and YES i am aware that value_counts doesn't report NaNs by default, but as you can se in the image they are not completly full of NaN and i'm loosing that data upon aggregatioin, any ideas? — alexzaizar09, Aug 02 '17 at 07:42

jezrael · Accepted Answer · 2017-08-02T09:05:23.240

I think there is problem after aggregate max you get all NaNs, so value_counts return empty Series:

df = pd.DataFrame({'A':[1,1,0,np.nan],
                   'npatience':[np.nan,np.nan,4,5],
                   'C':[1,0,np.nan,np.nan],
                   'D':[1,3,5,7]})

print (df)
     A    C  D  npatience
0  1.0  1.0  1        NaN
1  1.0  0.0  3        NaN
2  0.0  NaN  5        4.0
3  NaN  NaN  7        5.0

print (df.A.value_counts())
1.0    2
0.0    1
Name: A, dtype: int64

print (df.C.value_counts())
0.0    1
1.0    1
Name: C, dtype: int64

g = df.groupby('npatience').max()
print (g)
             A   C  D
npatience            
4.0        0.0 NaN  5
5.0        NaN NaN  7

print (g.C)
npatience
4.0   NaN
5.0   NaN
Name: C, dtype: float64

#check if in column are all values NaNs
print (g.C.isnull().all())
True

print (g.A)
npatience
4.0    0.0
5.0    NaN
Name: A, dtype: float64


print (g.C.value_counts())
Series([], Name: C, dtype: int64)

print (g.A.value_counts())
0.0    1
Name: A, dtype: int64

print (g.C.value_counts(dropna=False))
NaN    2
Name: C, dtype: int64

print (g.A.value_counts(dropna=False))
NaN     1
 0.0    1
Name: A, dtype: int64

EDIT:

groupby by default remove NaNs rows (cannot groups by NaNs), so it is same as call drop before groupby:

g = df.dropna(subset=['npatience'])
print (g)
             A   C  D
npatience            
4.0        0.0 NaN  5
5.0        NaN NaN  7

print (g.C)
2   NaN
3   NaN
Name: C, dtype: float64

#check if in column are all values NaNs
print (g.C.isnull().all())
True

And solution for groupby without remove NaNs is replace NaNs by value (which is not in df) like 1000:

g = df.fillna(1000).groupby('npatience').max()
print (g)
                A       C  D
npatience                   
4.0           0.0  1000.0  5
5.0        1000.0  1000.0  7
1000.0        1.0     1.0  3

print (g.C.value_counts())
1000.0    2
1.0       1
Name: C, dtype: int64

Yes, after the aggregate function the complicacioinescronicas series is all empty, but I don't know why since before the aggregation it has some values, that is what I'm trying to fix — alexzaizar09, Aug 02 '17 at 08:47
Yes, there is problem of data - for NaNs in `npatience` are some data only, another are NaNs. So if groupby, nans rows are removed and get all `NaNs`. groupby do `df = df.dropna('npatience').groupby('npatience').max()`, so lost data. — jezrael, Aug 02 '17 at 08:52

score 2 · Answer 2 · answered Aug 02 '17 at 07:54

2

Of course you are losing data when you doing aggregation (groupby + max) - that's normal.

Demo:

In [5]: df = pd.DataFrame(np.random.randint(0,5,(5,3)), columns=list('abc'))

In [6]: df
Out[6]:
   a  b  c
0  4  1  4
1  4  3  4
2  1  1  0
3  3  3  0
4  4  0  2

In [7]: df.b.value_counts()
Out[7]:
3    2
1    2
0    1
Name: b, dtype: int64

In [8]: df.c.value_counts()
Out[8]:
4    2
0    2
2    1
Name: c, dtype: int64

after aggregation:

In [9]: g = df.groupby('a').max()

In [10]: g
Out[10]:
   b  c
a
1  1  0
3  3  0
4  3  4

In [11]: g.b.value_counts()
Out[11]:
3    2
1    1
Name: b, dtype: int64

In [12]: g.c.value_counts()
Out[12]:
0    2
4    1
Name: c, dtype: int64

answered Aug 02 '17 at 07:54

MaxU - stand with Ukraine

205,989
36
386
419

Yes, data is supposed to be less but I am loosing data, where there is suppose to be 0 or 1 as in complicacionescronicas from the image, I'm getting NaN, and as far as I know, NaN is not bigger numerically than 0 or 1 – alexzaizar09 Aug 02 '17 at 08:00
in fact the docs say that when performing aggregations, NaN values are excluded, so the max agg should return something and not an empty series – alexzaizar09 Aug 02 '17 at 08:11
@alex, could you please provide a small __reproducible__ data set? – MaxU - stand with Ukraine Aug 02 '17 at 08:11
I don't know how to do that – alexzaizar09 Aug 02 '17 at 08:20
@alex, please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine Aug 02 '17 at 08:23
i could make it as the post said, but i put an image there that will give u the same result as i get – alexzaizar09 Aug 02 '17 at 08:48

Why am I losing data after I perform a GroupBy?

2 Answers2