Different behaviour of numpy sum min max functions when aggregating or when applied to list or array

Question

I see different behavior when applying the same numpy function as an aggregation function of groupby or to the same list of values, when nan values are involved.

This applies to np.sum np.min np.max and np.mean The behavior as aggregation function look the same as if np.nansum, np,nanmin etc are used

For example

import pandas as pd
import numpy as np
xx = pd.DataFrame([['A', 1.,  2.,      3.],
                   ['A', 3.,  np.nan,  4.],
                   ['B', 5.,  6.,      np.nan],
                   ['B', 7.,  8.,      9.]])

xx.groupby(0).agg(np.sum)

Gives

       1     2     3
0           
A    4.0   2.0   7.0
B   12.0  14.0   9.0

But np.array([np.nan,9.]).sum() or np.sum(np.array([np.nan,9])) or np.sum([np.nan,9]) they all output nan

I would have expected the aggregation function to produce nan as well, while the output i got to be generated with the use of np.nansum

pandas 0.24.2, numpy 1.16.2

It seems that rows containing nans are automatically dropped by groupby. Check out this answer: https://stackoverflow.com/questions/18429491/groupby-columns-with-nan-missing-values — dzang, Apr 24 '19 at 08:51

score 1 · Accepted Answer · answered Apr 24 '19 at 08:53

The difference comes from pandas behaviour and not numpy.sum(). np.NaN are automatically excluded in pandas.groupby

import pandas as pd
import numpy as np
xx = pd.DataFrame([['A', np.nan],
                   ['A', 4.],
                   ['B', 1],
                   ['B', 2]])

xx.groupby(0).count()

OUTPUT

Different behaviour of numpy sum min max functions when aggregating or when applied to list or array

1 Answers1