How do you sum up rows in Pandas based on conditions for multuples columns and remove the duplicates?

Question

First let me apologise for the long winded question. I've struggled to find an answer on Stackoverflow that addresses my specific issue. I am new to Pandas and Python programming so I would appreciate all the help I can get.

I have a dataframe:

 ID            Name  Colour      Power  Year  Money (millions)
0   1234567      Tony Stark     Red     Genius  2020             20000
1   9876543    Peter Parker     Red     Spider  2021                75
2   1415926   Miles Morales   Green     Spider  2021                55
3   7777777    Dante Brisco    Blue     hybrid  2020                 3
4   4355681    Thor Odinson    Blue  Lightning  2020               655
5   1928374     Bruce Wayne  Yellow        Bat  2021             12000
6   5555555     Eddie Brock   Black   Symbiote  2021               755
7   8183822  Billie Butcher  Yellow          V  2021                34
8   6666654        Ian Wilm     Red  Lightning  2020                34
9   4241111    Harry Potter   Green     Wizard  2020                24
10  7765434        Blu Malk     Red     Wizard  2021                77
11  6464647         Yu Hant   Black     Wizard  2021                65

I want to create a new df that looks like this:

 **Colour    Total      Year 2020     Year 2021**
 Red        20186     20034          152
 Green      79        24             55 
 Blue       658       658            -------
 Yellow     12034     -------        12034
 Black      820       -------        820

Where the "Colour" column becomes the new primary key/ID, the duplicates are removed and the values per year are summed up along with an overall total. I have managed to sum up the Total but I am struggling to write a function that will sum up rows by year and than assign the sum to the respective colour. I would eventually like to Create new columns based on calculations from the Yearly columns (percentages)

Here is what I have after creating the DF from an excel file :

#This line helps me calculate the total from the old df. 
df['Total'] = df.groupby(['Colour'])['Money (millions)'].transform('sum') 

#This line drops the duplicates from the line above. So now I have a total column that matches the #Colours
new_df = df.drop_duplicates(subset=['Colour'])

When I repeat the process for the Yearly column using the same technique it sums up the overall total for the whole year and assigns it to every colour.

I would eventually like to Create new columns based on calculations from the Yearly columns (percentages) e.g.

 new_df['Success Rate'] = new_df['Total'].apply(lambda x: (x/100)*33)

I'd be grateful for any help provided :)

Please provide a [mcve], as well as the current output. See [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391). — AMC, Oct 29 '20 at 00:08

score 2 · Accepted Answer · answered Oct 28 '20 at 23:38

2

You can use:

df = pd.pivot_table(df, index='Colour', values='Money (millions)', columns='Year', aggfunc='sum', margins=True)
df
Out[1]: 
Year       2020     2021    All
Colour                         
Black       NaN    820.0    820
Blue      658.0      NaN    658
Green      24.0     55.0     79
Red     20034.0    152.0  20186
Yellow      NaN  12034.0  12034
All     20716.0  13061.0  33777

answered Oct 28 '20 at 23:38

David Erickson

16,433
2
19
35

Thank you so much for the response. When I use your solution I get the desired output, but when I try applying a lambda function to each of the pivot table columns I get a key error. I figured the pivot table columns, 2020 and 2021, would be pandas columns. But when I try accessing them to apply the lambda I get an error saying the column name doesn't exist. – A.Bux Nov 04 '20 at 11:08
@A.Bux . The columns are multi-level. You can use ‘df.columns’ to figure out the column names. – David Erickson Nov 05 '20 at 00:00

score 2 · Answer 2 · answered Oct 28 '20 at 23:39

2

I think this is pivot_table with margins:

df.pivot_table(index='Colour', columns='Year', 
               values='Money (millions)',
               aggfunc='sum',
               margins_name='Total',
               margins=True)

OUtput:

Year       2020     2021  Total
Colour                         
Black       NaN    820.0    820
Blue      658.0      NaN    658
Green      24.0     55.0     79
Red     20034.0    152.0  20186
Yellow      NaN  12034.0  12034
Total   20716.0  13061.0  33777

answered Oct 28 '20 at 23:39

Quang Hoang

146,074
10
56
74

Thank you so much for the response. When I use your solution I get the desired output, but when I try applying a lambda function to each of the pivot table columns I get a key error. I figured the pivot table columns, 2020 and 2021, would be pandas columns. But when I try accessing them to apply the lambda I get an error saying the column name doesn't exist. – A.Bux Nov 04 '20 at 11:08

How do you sum up rows in Pandas based on conditions for multuples columns and remove the duplicates?

2 Answers2