8

consider the pd.Series s

a = np.arange(4)
mux = pd.MultiIndex.from_product([list('ab'), list('xy')])
s = pd.Series([a] * 4, mux)
print(s)

a  x    [0, 1, 2, 3]
   y    [0, 1, 2, 3]
b  x    [0, 1, 2, 3]
   y    [0, 1, 2, 3]
dtype: object

problem
each element of s is a numpy.array. when I try to sum within groups, I get an error because the groupby function expects the result to be scalar... (I'm guessing)

s.groupby(level=0).sum()
Exception                                 Traceback (most recent call last)
<ipython-input-627-c5b3bf6890ea> in <module>()
----> 1 s.groupby(level=0).sum()

C:\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in f(self)
    101             raise SpecificationError(str(e))
    102         except Exception:
--> 103             result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
    104             if _convert:
    105                 result = result._convert(datetime=True)

C:\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
   2584                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
   2585             except Exception:
-> 2586                 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
   2587 
   2588             index = Index(sorted(result), name=self.grouper.names[0])

C:\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in _aggregate_named(self, func, *args, **kwargs)
   2704             output = func(group, *args, **kwargs)
   2705             if isinstance(output, (Series, Index, np.ndarray)):
-> 2706                 raise Exception('Must produce aggregated value')
   2707             result[name] = self._try_cast(output, group)
   2708 

Exception: Must produce aggregated value

work around
when I use apply with np.sum, it works fine.

s.groupby(level=0).apply(np.sum)

a    [0, 2, 4, 6]
b    [0, 2, 4, 6]
dtype: object

question
is there an elegant way to handle this?


real problem
I actually want to use agg in this way

s.groupby(level=0).agg(['sum', 'prod'])

but it fails in the same way.
the only way to get this is to

pd.concat([g.apply(np.sum), g.apply(np.prod)],
          axis=1, keys=['sum', 'prod'])

enter image description here

but this doesn't generalize well to longer lists of transforms.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I think your work around is pretty elegant! – Steven G Oct 03 '16 at 23:47
  • @StevenG my apologies, I deleted this question before because it was ill formed. I had to run and now I'm trying to sharpen the question. My real issue is that I can't aggregate without using my workaround and `pd.concat` – piRSquared Oct 03 '16 at 23:58
  • 1
    I can't find a good link at the moment, but yeah, nonscalar elements aren't really supported. – DSM Oct 04 '16 at 00:01
  • 1
    from the notes : Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)). [see this at the bottom](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) – Steven G Oct 04 '16 at 00:13
  • [look at this answer](http://stackoverflow.com/questions/16975318/pandas-aggregate-when-column-contains-numpy-arrays) – Steven G Oct 04 '16 at 00:26

2 Answers2

7

from this well explained answer you could transform your ndarray to list because pandas seems to be checking if the output is a ndarray and this is why you are getting this error raised :

s.groupby(level=0).agg({"sum": lambda x: list(x.sum()), "prod":lambda x: list(x.prod())})

Out[249]:

            sum          prod
a  [0, 2, 4, 6]  [0, 1, 4, 9]
b  [0, 2, 4, 6]  [0, 1, 4, 9]
Community
  • 1
  • 1
Steven G
  • 16,244
  • 8
  • 53
  • 77
1

Pandas is not designed to have arrays as values. It's much better practice to use a DataFrame for s instead of a Series. This will give you expected behavior and be much faster than using lambdas/lists.

You can easily convert into a DataFrame with:

s = s.apply(pd.Series)

At that point aggregating on any level is really easy.

s.groupby(level=0).agg(['sum', 'prod'])

    0        1        2        3     
  sum prod sum prod sum prod sum prod
a   0    0   2    1   4    4   6    9
b   0    0   2    1   4    4   6    9

You can stop there, but I don't think this is the format you're ideally wanting. It's easy enough to re-stack the aggregation.

test = s.groupby(level=0).agg(['sum', 'prod'])
test = test.stack(level=0).unstack()
test

  prod          sum         
     0  1  2  3   0  1  2  3
a    0  1  4  9   0  2  4  6
b    0  1  4  9   0  2  4  6

And at that point you can call each of product and sum like you'd expect.

test['prod']

   0  1  2  3
a  0  1  4  9
b  0  1  4  9

Or if you want it back as an array:

test['prod'].values

array([[0, 1, 4, 9],
       [0, 1, 4, 9]])