0

I see different behavior when applying the same numpy function as an aggregation function of groupby or to the same list of values, when nan values are involved.

This applies to np.sum np.min np.max and np.mean The behavior as aggregation function look the same as if np.nansum, np,nanmin etc are used

For example

import pandas as pd
import numpy as np
xx = pd.DataFrame([['A', 1.,  2.,      3.],
                   ['A', 3.,  np.nan,  4.],
                   ['B', 5.,  6.,      np.nan],
                   ['B', 7.,  8.,      9.]])

xx.groupby(0).agg(np.sum)

Gives

       1     2     3
0           
A    4.0   2.0   7.0
B   12.0  14.0   9.0

But np.array([np.nan,9.]).sum() or np.sum(np.array([np.nan,9])) or np.sum([np.nan,9]) they all output nan

I would have expected the aggregation function to produce nan as well, while the output i got to be generated with the use of np.nansum

pandas 0.24.2, numpy 1.16.2

Marcello
  • 327
  • 1
  • 2
  • 11
  • It seems that rows containing nans are automatically dropped by groupby. Check out this answer: https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values – dzang Apr 24 '19 at 08:51

1 Answers1

1

The difference comes from pandas behaviour and not numpy.sum(). np.NaN are automatically excluded in pandas.groupby

import pandas as pd
import numpy as np
xx = pd.DataFrame([['A', np.nan],
                   ['A', 4.],
                   ['B', 1],
                   ['B', 2]])

xx.groupby(0).count()

OUTPUT

   1
0   
A  1
B  2
naivepredictor
  • 898
  • 4
  • 14