1

I have a DataFrame with a many-levelled MultiIndex.

I know that there are duplicates in the MultiIndex (because I don't care about a distinction that the underlying databse does care about)

I want to sum over these duplicates:

>>> x = pd.DataFrame({'month':['Sep', 'Sep', 'Oct', 'Oct'], 'day':['Mon', 'Mon', 'Mon', 'Tue'], 'sales':[1,2,3,4]})
>>> x
   day month  sales
0  Mon   Sep      1
1  Mon   Sep      2
2  Mon   Oct      3
3  Tue   Oct      4
>>> x = x.set_index(['day', 'month'])
           sales
day month       
Mon Sep        1
    Sep        2
    Oct        3
Tue Oct        4

To give me

day month       
Mon Sep        3
    Oct        3
Tue Oct        4

Buried deep in this SO answer to a similar question is the suggestion:

df.groupby(level=df.index.names).sum()

But this seems to me to fail the 'readability counts' criterion of good Python code.

Does anyone know of a more human-readable way?

Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • try: ``df.sum(level=[levels_that_I_want])`` – Jeff Aug 19 '14 at 16:38
  • I want all the levels, since I'm really looking for duplicates. – LondonRob Aug 19 '14 at 16:43
  • pls show an example then, as you normally won't have duplicates with a multi-index, so show construction. – Jeff Aug 19 '14 at 16:47
  • It might make sense to allow something like ``df.sum(level='all')``, which I guess is unambiguous if no levels are named 'all'. what do you think? – Jeff Aug 19 '14 at 17:38
  • Isn't this a more generic use-case than adding something to `sum`? Might I not also want to take the `max` and `min`? In which case, it might be more like `df.groupby(level='all')`. – LondonRob Aug 19 '14 at 17:52
  • of course, nothing to do with sum specifically (and ``df.sum(level=....)`` actually does a groupby), it has to do with handling ``level='all'`` translating to all levels. – Jeff Aug 19 '14 at 17:59
  • By the way, a simple way to avoid this duplicates business is to just have another arbitrary level of the index that deduplicates with `0, 1, 2...` – U2EF1 Aug 20 '14 at 04:33
  • @Jeff : `all` is an okayish name for a column (and hence for a level). How about something along the lines of `level=[]`? Me personally, I'd prefer inconsistency with `sum()` [maybe make `=all` deprecated after all] over breaking code if users use `all` as a column name (especially since that's not a magic word). – FooBar Aug 20 '14 at 11:23
  • yeh, using an empty list ``[]`` would prob be better. Pls open an issue for this suggestion if you'd like (with an example) – Jeff Aug 20 '14 at 11:54

0 Answers0