2

I have a pandas df like this

student_id A B
1 3 13
2 4 23
1 5 12
4 28 32
1 38 12
2 21 14

My desired output: I want to drop the duplicates, and count how many duplicates there are according to student_id and keeping the last record/row and append the count column as new column, also average the duplicated rows entry in A and B as new columns

student_id A B count average A rounded average B rounded
1 38 12 3 15 12
2 21 14 2 13 19
4 28 32 1 28 32
  • Does this answer your question? [Multiple aggregations of the same column using pandas GroupBy.agg()](https://stackoverflow.com/questions/12589481/multiple-aggregations-of-the-same-column-using-pandas-groupby-agg) – G. Anderson Apr 13 '21 at 20:34
  • Let us name your two tables: `INPUT_TABLE` and `OUTPUT_TABLE`. For `INPUT_TABLE`, for `student_id == 1`, there are three different `A` values: `{3, 5, 38}` In the output table, what is Column A supposed to contain? It sounds like for `student_id == 1` we find the bottom-most row of `INPUT_TABLE` for which `student_id == 1`, and the `A` value in that row is the `A` in `OUTPUT_TABLE`. Is that correct? – Toothpick Anemone Apr 13 '21 at 20:34
  • yes for A and B we keep the last column – queen of spades Apr 13 '21 at 20:38

2 Answers2

2

You can use named aggregation:

df.groupby('student_id', as_index=False).agg(
    A=('A', 'last'),
    B=('B', 'last'),
    count=('student_id', 'count'),
    average_A_rounded=('A', lambda x: np.mean(x).round()),
    average_B_rounded=('B', lambda x: np.mean(x).round()),
)

#    student_id   A   B  count  average_A_rounded  average_B_rounded
# 0           1  38  12      3                 15                 12
# 1           2  21  14      2                 12                 18
# 2           4  28  32      1                 28                 32
tdy
  • 36,675
  • 19
  • 86
  • 83
  • 1
    could you explain the count = ("A", "count") so after group by, we get distinct student_id and the count is refering to number of A? – queen of spades Apr 14 '21 at 04:50
  • @queenofspades Actually I did that originally because I thought `agg()` would not be able to access the `student_id` after `groupby('student_id')`. However it does seem to be able to access `student_id`, so I've edited the answer to count `student_id` instead. Technically we can count any of the columns after grouping and get the same result, but I agree that using `student_id` is more intuitive. – tdy Apr 14 '21 at 16:33
1

I see that you want round the values "half-up". So to extend the @tdy answer:

def round_half_up(x):
    mask = x >= 0
    out = np.empty_like(x)
    out[mask] = np.floor(x[mask] + 0.5)
    out[~mask] = np.ceil(x[~mask] - 0.5)
    return out


df = df.groupby("student_id", as_index=False).agg(
    A=("A", "last"),
    B=("B", "last"),
    count=("A", "count"),
    average_A_rounded=("A", "mean"),
    average_B_rounded=("B", "mean"),
)

print(df.apply(round_half_up).astype(int))

Prints:

   student_id   A   B  count  average_A_rounded  average_B_rounded
0           1  38  12      3                 15                 12
1           2  21  14      2                 13                 19
2           4  28  32      1                 28                 32
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91