dropping duplicates on one specific column and add a new column as count of repeat records pandas

Question

I have a pandas df like this

student_id	A	B
1	3	13
2	4	23
1	5	12
4	28	32
1	38	12
2	21	14

My desired output: I want to drop the duplicates, and count how many duplicates there are according to student_id and keeping the last record/row and append the count column as new column, also average the duplicated rows entry in A and B as new columns

student_id	A	B	count	average A rounded	average B rounded
1	38	12	3	15	12
2	21	14	2	13	19
4	28	32	1	28	32

Does this answer your question? [Multiple aggregations of the same column using pandas GroupBy.agg()](https://stackoverflow.com/questions/12589481/multiple-aggregations-of-the-same-column-using-pandas-groupby-agg) — G. Anderson, Apr 13 '21 at 20:34
Let us name your two tables: `INPUT_TABLE` and `OUTPUT_TABLE`. For `INPUT_TABLE`, for `student_id == 1`, there are three different `A` values: `{3, 5, 38}` In the output table, what is Column A supposed to contain? It sounds like for `student_id == 1` we find the bottom-most row of `INPUT_TABLE` for which `student_id == 1`, and the `A` value in that row is the `A` in `OUTPUT_TABLE`. Is that correct? — Toothpick Anemone, Apr 13 '21 at 20:34

tdy · Accepted Answer · 2021-04-14T16:31:35.113

2

You can use named aggregation:

df.groupby('student_id', as_index=False).agg(
    A=('A', 'last'),
    B=('B', 'last'),
    count=('student_id', 'count'),
    average_A_rounded=('A', lambda x: np.mean(x).round()),
    average_B_rounded=('B', lambda x: np.mean(x).round()),
)

#    student_id   A   B  count  average_A_rounded  average_B_rounded
# 0           1  38  12      3                 15                 12
# 1           2  21  14      2                 12                 18
# 2           4  28  32      1                 28                 32

edited Apr 14 '21 at 16:31

answered Apr 13 '21 at 20:41

tdy

36,675
19
86
83

1

could you explain the count = ("A", "count") so after group by, we get distinct student_id and the count is refering to number of A? – queen of spades Apr 14 '21 at 04:50
@queenofspades Actually I did that originally because I thought `agg()` would not be able to access the `student_id` after `groupby('student_id')`. However it does seem to be able to access `student_id`, so I've edited the answer to count `student_id` instead. Technically we can count any of the columns after grouping and get the same result, but I agree that using `student_id` is more intuitive. – tdy Apr 14 '21 at 16:33

score 1 · Answer 2 · answered Apr 13 '21 at 20:50

I see that you want round the values "half-up". So to extend the @tdy answer:

def round_half_up(x):
    mask = x >= 0
    out = np.empty_like(x)
    out[mask] = np.floor(x[mask] + 0.5)
    out[~mask] = np.ceil(x[~mask] - 0.5)
    return out


df = df.groupby("student_id", as_index=False).agg(
    A=("A", "last"),
    B=("B", "last"),
    count=("A", "count"),
    average_A_rounded=("A", "mean"),
    average_B_rounded=("B", "mean"),
)

print(df.apply(round_half_up).astype(int))

Prints:

   student_id   A   B  count  average_A_rounded  average_B_rounded
0           1  38  12      3                 15                 12
1           2  21  14      2                 13                 19
2           4  28  32      1                 28                 32

ah good catch with the rounding, thanks for implementing +1 – tdy Apr 14 '21 at 16:39 — tdy, Apr 14 '21 at 16:39

dropping duplicates on one specific column and add a new column as count of repeat records pandas

2 Answers2