Create multiple columns with calculations using groupby, aggregate functions in Pandas

Question

import pandas as pd
df = pd.DataFrame({'zip,company': ["46062|A","11236|B","11236|C","11236|C","11236|C","11236|A","11236|A","11236|A","11236|B","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A"], 
                   'goodbadscore': ["good","bad","bad","good","good","bad","bad","good","good","good","bad","good","good","good","good","bad","bad","good"],
                   'postlcode' : ["46062","11236","11236","11236","11236","46062","11236","46062","11236","11236","11236","11236","11236","11236","11236","11236","11236","11236"],
                   'companyname': ["A","B","C","C","C","A","A","A","B","B","A","A","B","A","A","B","A","A"]}
                   )
                   
print(df)

-----updated a sample data frame above as suggestion-----

I tried to produce the result in Excel, but using countif and countifs break my desktop and even it's fine, it takes several minutes to complete the task. hope can get some help and directions.

here is what i try to achieve:

I want to score company's' reputation in several zip codes based on the collected data. columns needed to produce:

countinzipcode
countgoodscoreinzip
dividegoodscore%(2/1)
ranking

I was able to produce 1:

op = df.groupby(['zip+company'])['zip+company'].count()
have difficulty on 2: want to keep the output from 1, but it becomes 0 after apply. only want to show good for column 2

op = op.groupby(['zip+company'])[['zip+company','countgoodscoreunderzip']].apply(lambda x: x[x=='good'].count())
then 3, I guess it's a matter of selecting 2 and divided by 1
4 no idea yet how to rank in pandas, which could be a simple ranking

The pic of excel is the ideal output(updated with a sample data frame).

Thanks for the reading.

Hey, Oliver. Folks will be able to better help if you post your data in a good pandas reproducible format (rather than screenshot). This post explains how to do so: [How to make good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — scotscotmcc, Jul 09 '21 at 17:02
@scotscotmcc thanks for the heads up and link. doing it now. — Oliver Bird, Jul 09 '21 at 18:13

score 1 · Accepted Answer · answered Jul 09 '21 at 19:18

Named aggregation should help the first two columns:

op = df.groupby('zip,company', as_index=False).aggregate(
    countinzipcode=('zip,company', 'count'),
    goodscoreinzip=('goodbadscore', lambda s: s.eq('good').sum())
)

op:

  zip,company  countinzipcode  goodscoreinzip
0     11236|A               7               4
1     11236|B               5               3
2     11236|C               3               2
3     46062|A               3               2

Simple math operations can be used to get the percentage for 3:

op['goodscore%'] = op['goodscoreinzip'] / op['countinzipcode'] * 100

  zip,company  countinzipcode  goodscoreinzip  goodscore%
0     11236|A               7               4   57.142857
1     11236|B               5               3   60.000000
2     11236|C               3               2   66.666667
3     46062|A               3               2   66.666667

Then rank can be used to get the ranking for 4:

op['ranking'] = op['goodscore%'].rank(ascending=False, method='dense').astype(int)

op:

  zip,company  countinzipcode  goodscoreinzip  goodscore%  ranking
0     11236|A               7               4   57.142857        3
1     11236|B               5               3   60.000000        2
2     11236|C               3               2   66.666667        1
3     46062|A               3               2   66.666667        1

Sample Data Used (Based on the numbers in the image not the code constructor):

df = pd.DataFrame({
    'zip,company': ["46062|A", "11236|B", "11236|C", "11236|C",
                    "11236|C", "11236|A", "11236|A", "11236|A",
                    "11236|B", "11236|B", "11236|A", "11236|A",
                    "11236|B", "11236|A", "11236|A", "11236|B",
                    "46062|A", "46062|A"],
    'goodbadscore': ["good", "bad", "bad", "good", "good", "bad",
                     "bad", "good", "good", "good", "bad",
                     "good", "good", "good", "good", "bad",
                     "bad", "good"],
    'postlcode': ["46062", "11236", "11236", "11236", "11236",
                  "46062", "11236", "46062", "11236", "11236",
                  "11236", "11236", "11236", "11236", "11236",
                  "11236", "11236", "11236"],
    'companyname': ["A", "B", "C", "C", "C", "A", "A", "A", "B",
                    "B", "A", "A", "B", "A", "A", "B", "A", "A"]
})

   zip,company goodbadscore postlcode companyname
0      46062|A         good     46062           A
1      11236|B          bad     11236           B
2      11236|C          bad     11236           C
3      11236|C         good     11236           C
4      11236|C         good     11236           C
5      11236|A          bad     46062           A
6      11236|A          bad     11236           A
7      11236|A         good     46062           A
8      11236|B         good     11236           B
9      11236|B         good     11236           B
10     11236|A          bad     11236           A
11     11236|A         good     11236           A
12     11236|B         good     11236           B
13     11236|A         good     11236           A
14     11236|A         good     11236           A
15     11236|B          bad     11236           B
16     46062|A          bad     11236           A
17     46062|A         good     11236           A

I changed to True because I got the ValueError(ValueError: cannot insert zip,company, already exists). why it's working on yours not mine? — Oliver Bird, Jul 12 '21 at 15:34

Create multiple columns with calculations using groupby, aggregate functions in Pandas

1 Answers1