1

I have a DataFrame that looks like this:

CUSTOMER_ID MONTH       ACTIVE
123456      2020-01     1
123456      2020-02     0
123456      2020-03     0
123456      2020-04     1
654321      2020-01     1
654321      2020-02     1
654321      2020-03     0
654321      2020-04     0

From this data, to each of the rows (which represents particular customer's performance in that month) I need to add the MONTH when was that particular customer last time ACTIVE, relative to that row's MONTH.

Ideally for the example data subset here, DataFrame should look like this:

CUSTOMER_ID MONTH       ACTIVE      LAST_TIME_ACTIVE
123456      2020-01     1               2020-01
123456      2020-02     0               2020-01
123456      2020-03     0               2020-01
123456      2020-04     1               2020-04
654321      2020-01     1               2020-01
654321      2020-02     1               2020-02
654321      2020-03     0               2020-02
654321      2020-04     0               2020-02

I tried the solution explained on this link , but the solution there will give me the general maximum, it doesn't satisfy the "relative to that row's month" condition.

On top of that I tried defining the function and call it from my DataFrame by using .apply(), but it is super slow, because every time filtering the whole DataFrame - and this is the costliest operation of them all.

Here is how the function is defined:

def get_last_active_month(dfRow, wholeDF) :
    
    try:
        lastActiveMonth = wholeDF[(wholeDF['CUSTOMER_ID']==dfRow['CUSTOMER_ID']) & (wholeDF['MONTH']<=dfRow['MONTH']) & (wholeDF['ACTIVE']==1)]['MONTH'].item()
    except:
        lastActiveMonth = '2017-12'
    finally:
        return lastActiveMonth;

I am working with more than 90 000 customers, and I need to apply this logic for the data starting in 2018 and all the way until today, so we are talking about really a lot of rows. Looping, of course, is out of the question (i tried even that as an act of desperation, and of course it is incredibly slow, and non-Pythonic "solution").

I am kindly asking for help in finding Pythonic and fast solution. Thank you!

3MAJ86
  • 15
  • 3

2 Answers2

1

Use pd.Series.where with groupby and ffill:

df["new"] = df["MONTH"].where(df["ACTIVE"].ne(0))

df["new"] = df.groupby("CUSTOMER_ID")["new"].ffill()

print (df)

   CUSTOMER_ID    MONTH  ACTIVE      new
0       123456  2020-01       1  2020-01
1       123456  2020-02       0  2020-01
2       123456  2020-03       0  2020-01
3       123456  2020-04       1  2020-04
4       654321  2020-01       1  2020-01
5       654321  2020-02       1  2020-02
6       654321  2020-03       0  2020-02
7       654321  2020-04       0  2020-02
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • Thank you so much! It works on the small data set that I am testing my code on. I will now test it on all the data, and come back and accept the solution. – 3MAJ86 Sep 15 '20 at 12:41
1

Pandas one (obfuscated) liner (assuming using Date type):

df['month_last_active'] = df.groupby([df.CUSTOMER_ID, df.groupby('CUSTOMER_ID').ACTIVE.cumsum()]).MONTH.cummin()
hello_friend
  • 5,682
  • 1
  • 11
  • 15
  • Thanks! Unfortunately the data type is str, because our billing months were not always real-world months, but rather tied to our internal business logic, which made us have more than 20 months even in some years :) I can't test the solution on my data, but I am sure it is useful! – 3MAJ86 Sep 15 '20 at 12:59
  • no worries, pls note you can `df['MONTH'] = pd.to_datetime(df['MONTH'])` coerce to a date prior to applying the one liner (making it a two-liner) – hello_friend Sep 15 '20 at 13:01
  • @3MAJ86 also if you find my solution useful don't forget to upvote ;) ! – hello_friend Sep 15 '20 at 13:02
  • I did upvote, although having in mind that I am a new user here, my upvotes are not visible :( – 3MAJ86 Sep 15 '20 at 13:03