0

enter image description here

The above table is the entry data. I am trying to get the Total sum of points got by each student, the maximum point got by each student, and the name of the subject.

The below table is the result. What is the most efficient way to use groupby.

enter image description here

skk
  • 25
  • 4
  • post your data as text instead of images so that we can reproduce your dataframe – Anurag Dabas May 16 '21 at 01:43
  • See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888), and [Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/q/285551/15497888) – Henry Ecker May 16 '21 at 01:44

2 Answers2

3

Use groupby() method:

resdf=df.groupby('Name').agg(Sum=('Point','sum'),MaxSub=('Sub','last'),Point=('Point','max'))

Output of resdf:

    Name    Sum     MaxSub  Point
0   A       210     Socio   90
1   B       115     Com     70
2   C       150     Eng     90
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
2

Let's try to groupby transform to get the sum for each group, sort_values to ensure that the max points is at the end of each group, then drop_duplicates with keep last to keep only max row:

import pandas as pd

df = pd.DataFrame({'SN': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
                   'Name': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'C',
                            6: 'C'},
                   'Sub': {0: 'Math', 1: 'Eng', 2: 'Socio', 3: 'Geo', 4: 'Com',
                           5: 'Com', 6: 'Eng'},
                   'Point': {0: 70, 1: 50, 2: 90, 3: 45, 4: 70, 5: 60, 6: 90}})

# Get Sum For each Group
df['Sum'] = df.groupby('Name')['Point'].transform('sum')

# Sort Values so Highest Point value is the last in each group
df = df.sort_values(['Name', 'Point'])

# Keep Only the last from each Group
df = df.drop_duplicates('Name', keep='last').reset_index(drop=True)

# Re-order and rename Columns
df = df[['Name', 'Sum', 'Sub', 'Point']].rename(columns={'Sub': 'Max Sub'})

print(df)

df:

  Name  Sum Max Sub  Point
0    A  210   Socio     90
1    B  115     Com     70
2    C  150     Eng     90

Since "most efficient way" was noted in the question, here's a perfplot:

enter image description here

Depending on whether the DataFrame has more or less than 10,000 rows will determine which option is more performant.

import string

import numpy as np
import pandas as pd
import perfplot


def gen_data(n):
    return pd.DataFrame({
        'Name': np.random.choice(list(string.ascii_uppercase)[:max(3, n // 2)],
                                 size=n),
        'Sub': np.random.choice(['Math', 'Eng', 'Socio', 'Com'], size=n),
        'Point': np.random.randint(50, 90, size=n)
    }).sort_values('Name') \
        .reset_index(drop=True) \
        .reset_index() \
        .rename(columns={'index': 'SN'}) \
        .assign(SN=lambda s: s.SN + 1)


def anurag_dabas(df):
    return df.groupby('Name').agg(Sum=('Point', 'sum'),
                                  MaxSub=('Sub', 'last'),
                                  Point=('Point', 'max'))


def henry_ecker(df):
    df['Sum'] = df.groupby('Name')['Point'].transform('sum')
    return df.sort_values(['Name', 'Point']) \
        .drop_duplicates('Name', keep='last') \
        .reset_index(drop=True)[['Name', 'Sum', 'Sub', 'Point']] \
        .rename(columns={'Sub': 'Max Sub'})


if __name__ == '__main__':
    out = perfplot.bench(
        setup=gen_data,
        kernels=[
            anurag_dabas,
            henry_ecker
        ],
        labels=['Anurag Dabas', 'Henry Ecker'],
        n_range=[2 ** k for k in range(25)],
        equality_check=None
    )
    out.save('perfplot_results.png', transparent=False)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57