How to count occurrences of a distinct value in a column?

Question

I have already sorted the dataframe by "ColumnA' and 'Column B' Column A is has identifiers encoded as string and Column B is pandas timestamp

I have a dataframe that has two columns where the first is the customer id and the second column is the timestamp in which they accessed a certain feature. Both columns have no NA and have been encoded correctly

ColumnA    ColumnnB
Client A   timestamp0_A
Client B   timestamp0_B
Client B   timestamp1_B
Client B   timestamp2_B

I want a counter that would keep incrementing if a certain client used a feature multiple times

ColumnA    ColumnnB         Counter
Client A   timestamp0_A     1
Client B   timestamp0_B     1
Client B   timestamp1_B     2
Client B   timestamp2_B     3

I cant do this in SQL with window functions but need to do this in Pandas because of data transfer issues. Any help of pointers would be appreciated

I tried iterating through the rows and referencing the prior rows but its extremely slow as the lookback windows are arbitrarily long

score 0 · Accepted Answer · answered Aug 16 '23 at 04:41

0

Iterating is slow for a large data set. cumcount is a function to get what you are looking for.

import pandas as pd
data = {
    'ColumnA': ['Client A', 'Client B', 'Client B', 'Client B'],
    'ColumnB': ['timestamp0_A', 'timestamp0_B', 'timestamp1_B', 'timestamp2_B']
}
df = pd.DataFrame(data)
df['Counter'] = df.groupby('ColumnA').cumcount() + 1
print(df)

Ouput

	ColumnA	ColumnB	Counter
0	Client A	timestamp0_A	1
1	Client B	timestamp0_B	1
2	Client B	timestamp1_B	2
3	Client B	timestamp2_B	3

answered Aug 16 '23 at 04:41

taller_ExcelHome

2,232
1
2
12

Hi taller_ExcelHome, thanks, as everything is sorted by client ids and then time stamps, I can assume that the cumumlative count will respect the temporal sorting? – longgamma Aug 16 '23 at 04:51
Yes, pls refers to pandas doc for detail. https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.core.groupby.GroupBy.cumcount.html – taller_ExcelHome Aug 16 '23 at 04:58

score 0 · Answer 2 · answered Aug 16 '23 at 05:00

You can do the grouping on ColumnA and then do the ranking on the ColumnB

In [171]: df["Counter"] = df.groupby(["ColumnA"])["ColumnB"].rank(method="first")

In [172]: df
Out[172]:
    ColumnA       ColumnB  Counter
0  Client A  timestamp0_A      1.0
1  Client B  timestamp0_B      1.0
2  Client B  timestamp1_B      2.0
3  Client B  timestamp2_B      3.0

How to count occurrences of a distinct value in a column?

2 Answers2