23

I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.

This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.

This is the function that I have been using:

def MakeList(x):
    """ This function is used to aggregate data that needs to be kept distinc within multi day 
        observations for later use and transformation. It makes a list of the data and if the list is of length 1
        then there is only one line/day observation in that group so the single element of the list is returned. 
        If the list is longer than one then there are multiple line/day observations and the list itself is 
        returned."""
    L = x.tolist()
    if len(L) > 1:
        return L
    else:
        return L[0]

Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:

import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02'],
                    'line_code':   ['401101',
                                    '401101',
                                    '401102',
                                    '401103',
                                    '401104',
                                    '401105',
                                    '401105',
                                    '401106',
                                    '401106',
                                    '401107'],
                    's.m.v.': [ 7.760,
                                25.564,
                                25.564,
                                9.550,
                                4.870,
                                7.760,
                                25.564,
                                5.282,
                                25.564,
                                5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

In trying to debug this, I put a print statement to the effect of print L and print x.index and the output was as follows:

[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')

For some reason it appears that agg is passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.

For example if I write a function like this:

def test_func(x):
    print x.index
    return x.iloc[0]

This runs without problem and the print statements are:

DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})

Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')

Which indicates that each group is only being passed once as a Series to the function.

Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....

Thanks

Woody Pride
  • 13,539
  • 9
  • 48
  • 62
  • 2
    It is possible that pandas gets confused if your function sometimes returns a list and sometimes a single value, since different dtypes would be used for those two cases. It is probably better not to do it that way. The calling-twice behavior could be related to the issue described [here](http://stackoverflow.com/questions/21390035/python-pandas-groupby-object-apply-method-duplicates-first-group) for `apply`: it calls the function twice on the first group in order to check whether the function mutates the existing data. – BrenBarn Dec 12 '14 at 09:09
  • Hmmm.... I should try setting as object dtype perhaps. – Woody Pride Dec 12 '14 at 14:09
  • The strangest thing is,im reuse this code all the time with no issues. I know apply and transform pass different packets of data such that it is quite hard to ascertain from print statements what is going on, but agh is fairly straightforward. Were you able to recreate the error? – Woody Pride Dec 12 '14 at 15:35
  • I can reproduce the error, but I can't reproduce the non-error of it working. Your `test_func` does reduce because it returns only a single value. Do you have a working example where the aggregating function returns a list? Did that ever work for you? – BrenBarn Dec 12 '14 at 18:11
  • Yes has worked for over a year since I write the damn thing, that's why I'm so perplexed. I'll try to generate done data for which it works. – Woody Pride Dec 13 '14 at 02:42
  • 1
    an interesting solution is to return `tuple(L)` instead of `L` – Woody Pride Dec 14 '14 at 16:02

2 Answers2

38

I can't really explain you why, but from my experience list in pandas.DataFrame don't work all that well.

I usually use tuple instead. That will work:

def MakeList(x):
    T = tuple(x)
    if len(T) > 1:
        return T
    else:
        return T[0]

DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

     date line_code           s.m.v.
0  2013-04-02    401101   (7.76, 25.564)
1  2013-04-02    401102           25.564
2  2013-04-02    401103             9.55
3  2013-04-02    401104             4.87
4  2013-04-02    401105   (7.76, 25.564)
5  2013-04-02    401106  (5.282, 25.564)
6  2013-04-02    401107            5.282
paulo.filip3
  • 3,167
  • 1
  • 23
  • 28
  • 4
    It has to do with the fact the `tuple` type is immutable and therefore hashable and `list` is not. – paulo.filip3 Jul 17 '17 at 13:10
  • probably! But the concepts are the same from "does not aggregate" perspective, so there is no way to guess something won't work b/c you are using lists and not tuples. Nice catch! – Ufos Nov 12 '17 at 21:47
17

This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:

def _aggregate_series_pure_python(self, obj, func):

    group_index, _, ngroups = self.group_info

    counts = np.zeros(ngroups, dtype=int)
    result = None

    splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)

    for label, group in splitter:
        res = func(group)
        if result is None:
            if (isinstance(res, (Series, Index, np.ndarray)) or
                    isinstance(res, list)):
                raise ValueError('Function does not reduce')
            result = np.empty(ngroups, dtype='O')

        counts[label] = group.shape[0]
        result[label] = res

Notice that if result is None and isinstance(res, list. Your options are:

  1. Fake out groupby().agg(), so it doesn't see a list for the first group, or

  2. Do the aggregation yourself, using code like that above but without the erroneous test.

Nik Bates-Haus
  • 211
  • 2
  • 4
  • 2
    as explained in the other answer `tuple` will work just fine. Precisely b/c the above function does not check whether the object is a `tuple`. Bug or a feature -- you decide! – Ufos Nov 12 '17 at 21:48