I have a DataFrame that looks like this:
CUSTOMER_ID MONTH ACTIVE
123456 2020-01 1
123456 2020-02 0
123456 2020-03 0
123456 2020-04 1
654321 2020-01 1
654321 2020-02 1
654321 2020-03 0
654321 2020-04 0
From this data, to each of the rows (which represents particular customer's performance in that month) I need to add the MONTH when was that particular customer last time ACTIVE, relative to that row's MONTH.
Ideally for the example data subset here, DataFrame should look like this:
CUSTOMER_ID MONTH ACTIVE LAST_TIME_ACTIVE
123456 2020-01 1 2020-01
123456 2020-02 0 2020-01
123456 2020-03 0 2020-01
123456 2020-04 1 2020-04
654321 2020-01 1 2020-01
654321 2020-02 1 2020-02
654321 2020-03 0 2020-02
654321 2020-04 0 2020-02
I tried the solution explained on this link , but the solution there will give me the general maximum, it doesn't satisfy the "relative to that row's month" condition.
On top of that I tried defining the function and call it from my DataFrame by using .apply(), but it is super slow, because every time filtering the whole DataFrame - and this is the costliest operation of them all.
Here is how the function is defined:
def get_last_active_month(dfRow, wholeDF) :
try:
lastActiveMonth = wholeDF[(wholeDF['CUSTOMER_ID']==dfRow['CUSTOMER_ID']) & (wholeDF['MONTH']<=dfRow['MONTH']) & (wholeDF['ACTIVE']==1)]['MONTH'].item()
except:
lastActiveMonth = '2017-12'
finally:
return lastActiveMonth;
I am working with more than 90 000 customers, and I need to apply this logic for the data starting in 2018 and all the way until today, so we are talking about really a lot of rows. Looping, of course, is out of the question (i tried even that as an act of desperation, and of course it is incredibly slow, and non-Pythonic "solution").
I am kindly asking for help in finding Pythonic and fast solution. Thank you!