8

I have a dataframe which have three level of index, and I wish to calculate how much a value deviates from the mean. But I have different mean for different groups based on my indices. This is what I tried:

In [4]: df['count'].groupby(level=[0,1,2]).apply(lambda x: x-np.mean(x))

However, I get an error, the stack trace of which I have inserted below. I am not sure why the problem is so.


Exception                                 Traceback (most recent call last)
<ipython-input-4-678992689ff2> in <module>()
----> 1 df['count'].groupby(level=[0,1,2]).apply(lambda x: x-np.mean(x))

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in apply(self, func, *args, **kwargs)
    713         # ignore SettingWithCopy here in case the user mutates
    714         with option_context('mode.chained_assignment',None):
--> 715             return self._python_apply_general(f)
    716
    717     def _python_apply_general(self, f):

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in _python_apply_general(self, f)
    720
    721         return self._wrap_applied_output(keys, values,
--> 722                                          not_indexed_same=mutated)
    723
    724     def aggregate(self, func, *args, **kwargs):

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in _wrap_applied_output(self, keys, values, not_indexed_same)
   2520         if isinstance(values[0], (Series, dict)):
   2521             return self._concat_objects(keys, values,
-> 2522                                         not_indexed_same=not_indexed_same)
   2523         elif isinstance(values[0], DataFrame):
   2524             # possible that Series -> DataFrame by applied function

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.pyc in _concat_objects(self, keys, values, not_indexed_same)
   1258
   1259             if isinstance(result, Series):
-> 1260                 result = result.reindex(ax)
   1261             else:
   1262                 result = result.reindex_axis(ax, axis=self.axis)

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\series.pyc in reindex(self, index, **kwargs)
   2266     @Appender(generic._shared_docs['reindex'] % _shared_doc_kwargs)
   2267     def reindex(self, index=None, **kwargs):
-> 2268         return super(Series, self).reindex(index=index, **kwargs)
   2269
   2270     @Appender(generic._shared_docs['fillna'] % _shared_doc_kwargs)

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs)
   1960         # perform the reindex on the axes
   1961         return self._reindex_axes(axes, level, limit, tolerance,
-> 1962                                   method, fill_value, copy).__finalize__(self)
   1963
   1964     def _reindex_axes(self, axes, level, limit, tolerance, method,

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\generic.pyc in _reindex_axes(self, axes, level, limit, tolerance, method, fil
l_value, copy)
   1974             new_index, indexer = ax.reindex(
   1975                 labels, level=level, limit=limit, tolerance=tolerance,
-> 1976                 method=method)
   1977
   1978             axis = self._get_axis_number(a)

C:\Users\bchandra\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit, tolerance)
   5280                 else:
   5281                     raise Exception(
-> 5282                         "cannot handle a non-unique multi-index!")
   5283
   5284         if not isinstance(target, MultiIndex):

Exception: cannot handle a non-unique multi-index!

My data frame looks something like this:

                 Count
      Cat SubCat 
        1   2      7
        1   2      5
        1   3      4
        1   3      3
        4   5      2
        4   5      1
        4   7      0
        4   7     -1

For simplicity sake, say my index is 2 levels instead of 3. What I want to do is group by (cat,Sub) which means (category,subcategory).

Then find the mean of all group which will be 7+5/2=6 here in the first case where I have grouped by cat=1, sub=2. Then I would like to find 7-6 and 5-6 respectively.

so something like df.groupby(level=[0,1]).apply(lambda x: x-np.mean(x))

Some dummy code that shows error on my pc (Pandas version 0.17.1) :

index=[]
[index.append(x) for y in range(25) for x in np.arange(2)]
subindex=[]
[subindex.append(10*x) for y in range(25) for x in np.arange(2)]
sample=pd.DataFrame({'count':np.arange(2*25),'cat':index,'sub':subindex,'date':np.random.randint(2*25)})
sample.set_index(['cat','sub'],inplace=True)
sample['count'].groupby(level=[0,1]).apply(lambda x: x-np.mean(x))
smci
  • 32,567
  • 20
  • 113
  • 146
CoderBC
  • 1,262
  • 2
  • 13
  • 30
  • Can you post data and code to reproduce this error also does this produce the same error `df['count'].groupby(level=[0,1,2]).head()`? – EdChum Mar 17 '16 at 13:57
  • 2
    you have duplicates in your index, and your function doesn't work with duplicates. remove the duplicates (either by deleting them or adding a new field to the index that distinguishes them) and you'll be fine. http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries – Paul H Mar 17 '16 at 13:58
  • @EdChum, it does not show an error on running that command. I have posted some dummy data if you'd like to see. – CoderBC Mar 17 '16 at 14:12
  • @PaulH Nope, that does not solve my problem because I need the duplicates to be there. – CoderBC Mar 17 '16 at 14:12
  • 2
    On your sample input data it works for me: `In [64]: df.groupby(level=[0,1]).apply(lambda x: x-np.mean(x)) df.groupby(level=[0,1]).apply(lambda x: x-np.mean(x)) Out[64]: Count Cat SubCat 1 2 1.0 2 -1.0 3 0.5 3 -0.5 4 5 0.5 5 -0.5 7 0.5 7 -0.5` – EdChum Mar 17 '16 at 14:25
  • 1
    it works for me too on pandas 0.15.1 – jrjc Mar 17 '16 at 14:28
  • @EdChum There is a subtle difference in my first command and the dummy one. I have more than one column in my data set, i have 'count' and 'date' and 2 more. I think that might be making a difference here. I will try to debug further. Thanks! – CoderBC Mar 17 '16 at 14:34
  • could you just reset the index and then group on plain old columns? – Paul H Mar 17 '16 at 14:59
  • @PaulH yes that is what I did for now Still I would like to find out what is the problem with the first approach. – CoderBC Mar 17 '16 at 15:10
  • Post a reproducible example – Paul H Mar 17 '16 at 15:11
  • @PaulH I have updated the question – CoderBC Mar 17 '16 at 15:34

2 Answers2

2

Having a multiindex dataframe, it's possible that you didn't notice there were duplicated column names at level 0, 1 or 2. This is what happened to me. If you want to aggregate data, count the columns with a code like:

df.columns.value_counts()

then drop the duplicated columns.

0

I had the same error once while aggregating the data. Having many columns, I didn't notice that some of them were duplicated columns. Before grouping by, check all your column names or use a function to drop duplicates in your hierarchical index.