Sum duplicate rows in two columns in Pandas dataframe by index

Question

I have the following dataframe, read into pandas from a csv.

           Fruit    Apple   Pear
Date                  
2016-03-30  Pear            1
2016-04-14  Pear            1
2016-04-14  Pear            1
2016-05-09  Apple   1
2016-05-18  Apple   1   
2016-06-24  Pear            1
2016-06-27  Apple   1   
2016-06-27  Pear            1
2016-06-28  Apple   1   
2016-06-28  Apple   1   
2016-07-05  Pear            1

I'd like it to look like this. It sums the duplicate rows and then drops the duplicate row. The value in the Fruit column doesn't matter at this point. I'll drop that column later.

           Fruit    Apple   Pear
Date                  
2016-03-30  Pear            1
2016-04-14  Pear            2
2016-05-09  Apple   1
2016-05-18  Apple   1   
2016-06-24  Pear            1
2016-06-27  Apple   1       1
2016-06-28  Apple   2   
2016-06-28  Apple   1   
2016-07-05  Pear            1

I tried using groupby, but it filled the Apple/Pear columns with 1s. I tried a for loop that if the next date index was the same as the current date index, it would += 1 to the respective Apple/Pear column. That failed because the type of the column, and my attempts to change the column type failed as well. Banging my head against this one and I can't get it.

The one thing I can do right is drop the duplicates with df = df.reset_index().drop_duplicates('Date',keep='last').set_index('Date')

Reading [this](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to ask a good pandas question might be helpful. Your posted frames are somewhat hard to parse, and it's not clear whether the value at 2016-03-30/Apple is a 0 or a NaN and you've suppressed it for some reason or an empty string, etc. — DSM, Mar 17 '17 at 15:25

score 2 · Accepted Answer · edited May 23 '17 at 12:32

2

IIUC you need groupby by index - level=0 and aggregate sum:

df = df.groupby(level=0).sum()
print (df)
            Apple  Pear
Date                   
2016-03-30    NaN   1.0
2016-04-14    NaN   2.0
2016-05-09    1.0   NaN
2016-05-18    1.0   NaN
2016-06-24    NaN   1.0
2016-06-27    1.0   1.0
2016-06-28    2.0   NaN
2016-07-05    NaN   1.0

If need cumsum as mentioned Noobie, thank you:

df = df.groupby(level=0).cumsum() 
print (df)
            Apple  Pear
Date                   
2016-03-30    NaN   1.0
2016-04-14    NaN   1.0
2016-04-14    NaN   2.0
2016-05-09    1.0   NaN
2016-05-18    1.0   NaN
2016-06-24    NaN   1.0
2016-06-27    1.0   NaN
2016-06-27    NaN   1.0
2016-06-28    1.0   NaN
2016-06-28    2.0   NaN
2016-07-05    NaN   1.0

edited May 23 '17 at 12:32

Community

1
1

answered Mar 17 '17 at 15:14

jezrael

822,522
95
1,334
1,252

`df = df.groupby(level=0).sum().min().max().cumsum()` works as well – ℕʘʘḆḽḘ Mar 17 '17 at 15:17
@Noobie - Thank you for suggestion, maybe you think something like `df.groupby(level=0).cumsum().reset_index().drop_duplicates('Date', keep='last')` ? – jezrael Mar 17 '17 at 15:19
yes thats more efficient – ℕʘʘḆḽḘ Mar 17 '17 at 15:22
Thanks for the suggestions. Noobie's results in `AttributeError: 'str' object has no attribute 'cumsum'` – slard Mar 17 '17 at 15:34
@jezrael's gives this `pandas.core.base.DataError: No numeric types to aggregate`. The df dtype is object. Is that the problem with trying to sum or cumsum them? – slard Mar 17 '17 at 15:36
Yes, need cast to numeric by `df[['Apple', 'Pear']] = df[['Apple', 'Pear']].astype(float)` and if it does not work need to_numeric like `df['Apple'] = pd.to_numeric(df['Apple'], errors='coerce')` and then `df['Pear'] = pd.to_numeric(df['Pear'], errors='coerce')` for replace non numeric to Nan – jezrael Mar 17 '17 at 16:18
`df['Apple'].apply(pd.to_numeric)` and to ['Pear'] seems to have fixed those errors. Chaining @jezrael's drop_duplicates gives a key error, but I can use the other drop dupe code. .sum works better than .cumsum for this. You can see on 2016-06-27, .cumsum left the dates split. Thanks to you both. – slard Mar 17 '17 at 16:19

score 2 · Answer 2 · answered Nov 30 '18 at 23:39

2

You can simply try this

df.groupby(['Date','Fruit']).agg('sum')

I just realized it's a duplicate question, see this.

answered Nov 30 '18 at 23:39

Abu Shoeb

4,747
2
40
45

Sum duplicate rows in two columns in Pandas dataframe by index

2 Answers2