The above table is the entry data. I am trying to get the Total sum of points got by each student, the maximum point got by each student, and the name of the subject.
The below table is the result. What is the most efficient way to use groupby.
The above table is the entry data. I am trying to get the Total sum of points got by each student, the maximum point got by each student, and the name of the subject.
The below table is the result. What is the most efficient way to use groupby.
Use groupby()
method:
resdf=df.groupby('Name').agg(Sum=('Point','sum'),MaxSub=('Sub','last'),Point=('Point','max'))
Output of resdf
:
Name Sum MaxSub Point
0 A 210 Socio 90
1 B 115 Com 70
2 C 150 Eng 90
Let's try to groupby transform to get the sum for each group, sort_values to ensure that the max points is at the end of each group, then drop_duplicates with keep last to keep only max row:
import pandas as pd
df = pd.DataFrame({'SN': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
'Name': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'C',
6: 'C'},
'Sub': {0: 'Math', 1: 'Eng', 2: 'Socio', 3: 'Geo', 4: 'Com',
5: 'Com', 6: 'Eng'},
'Point': {0: 70, 1: 50, 2: 90, 3: 45, 4: 70, 5: 60, 6: 90}})
# Get Sum For each Group
df['Sum'] = df.groupby('Name')['Point'].transform('sum')
# Sort Values so Highest Point value is the last in each group
df = df.sort_values(['Name', 'Point'])
# Keep Only the last from each Group
df = df.drop_duplicates('Name', keep='last').reset_index(drop=True)
# Re-order and rename Columns
df = df[['Name', 'Sum', 'Sub', 'Point']].rename(columns={'Sub': 'Max Sub'})
print(df)
df
:
Name Sum Max Sub Point
0 A 210 Socio 90
1 B 115 Com 70
2 C 150 Eng 90
Since "most efficient way" was noted in the question, here's a perfplot:
Depending on whether the DataFrame has more or less than 10,000 rows will determine which option is more performant.
import string
import numpy as np
import pandas as pd
import perfplot
def gen_data(n):
return pd.DataFrame({
'Name': np.random.choice(list(string.ascii_uppercase)[:max(3, n // 2)],
size=n),
'Sub': np.random.choice(['Math', 'Eng', 'Socio', 'Com'], size=n),
'Point': np.random.randint(50, 90, size=n)
}).sort_values('Name') \
.reset_index(drop=True) \
.reset_index() \
.rename(columns={'index': 'SN'}) \
.assign(SN=lambda s: s.SN + 1)
def anurag_dabas(df):
return df.groupby('Name').agg(Sum=('Point', 'sum'),
MaxSub=('Sub', 'last'),
Point=('Point', 'max'))
def henry_ecker(df):
df['Sum'] = df.groupby('Name')['Point'].transform('sum')
return df.sort_values(['Name', 'Point']) \
.drop_duplicates('Name', keep='last') \
.reset_index(drop=True)[['Name', 'Sum', 'Sub', 'Point']] \
.rename(columns={'Sub': 'Max Sub'})
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
anurag_dabas,
henry_ecker
],
labels=['Anurag Dabas', 'Henry Ecker'],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)