0

I have already sorted the dataframe by "ColumnA' and 'Column B' Column A is has identifiers encoded as string and Column B is pandas timestamp

I have a dataframe that has two columns where the first is the customer id and the second column is the timestamp in which they accessed a certain feature. Both columns have no NA and have been encoded correctly

ColumnA    ColumnnB
Client A   timestamp0_A
Client B   timestamp0_B
Client B   timestamp1_B
Client B   timestamp2_B

I want a counter that would keep incrementing if a certain client used a feature multiple times

ColumnA    ColumnnB         Counter
Client A   timestamp0_A     1
Client B   timestamp0_B     1
Client B   timestamp1_B     2
Client B   timestamp2_B     3

I cant do this in SQL with window functions but need to do this in Pandas because of data transfer issues. Any help of pointers would be appreciated

I tried iterating through the rows and referencing the prior rows but its extremely slow as the lookback windows are arbitrarily long

longgamma
  • 3
  • 2

2 Answers2

0

Iterating is slow for a large data set. cumcount is a function to get what you are looking for.

import pandas as pd
data = {
    'ColumnA': ['Client A', 'Client B', 'Client B', 'Client B'],
    'ColumnB': ['timestamp0_A', 'timestamp0_B', 'timestamp1_B', 'timestamp2_B']
}
df = pd.DataFrame(data)
df['Counter'] = df.groupby('ColumnA').cumcount() + 1
print(df)

Ouput

ColumnA ColumnB Counter
0 Client A timestamp0_A 1
1 Client B timestamp0_B 1
2 Client B timestamp1_B 2
3 Client B timestamp2_B 3
taller_ExcelHome
  • 2,232
  • 1
  • 2
  • 12
  • Hi taller_ExcelHome, thanks, as everything is sorted by client ids and then time stamps, I can assume that the cumumlative count will respect the temporal sorting? – longgamma Aug 16 '23 at 04:51
  • Yes, pls refers to pandas doc for detail. https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.core.groupby.GroupBy.cumcount.html – taller_ExcelHome Aug 16 '23 at 04:58
0

You can do the grouping on ColumnA and then do the ranking on the ColumnB

In [171]: df["Counter"] = df.groupby(["ColumnA"])["ColumnB"].rank(method="first")

In [172]: df
Out[172]:
    ColumnA       ColumnB  Counter
0  Client A  timestamp0_A      1.0
1  Client B  timestamp0_B      1.0
2  Client B  timestamp1_B      2.0
3  Client B  timestamp2_B      3.0
Hackaholic
  • 19,069
  • 5
  • 54
  • 72