1

I want to find the solution for this,

Provided a table with user_id and the dates they visited the platform, find the top 100 users with the longest continuous streak of visiting the platform as of yesterday.

I found these links that explain how to do this for one user. However, I am not sure how to do it for multiple users.

One naive might be to get all unique users and using a for loop and above answer, get the users with a maximum continuous visiting streak. However, I am interested in a vectorised way if possible.

If needed, these are the codes I used,

date_series = pd.Series(np.random.randint(0,10, 400), index=pd.to_datetime(np.random.randint(0,20, 400)*1e9*24*3600), name="uid")
df = date_series.reset_index().rename({"index":"date_val"}, axis=1).drop_duplicates().reset_index(drop=True)

For a given user id (say uid =1), I can use the following to find the max streak,

sub_df = df[df.uid==1].sort_values("date_val")
(sub_df.date_val+pd.Timedelta(days=1) != sub_df.date_val.shift(-1)).cumsum().value_counts().max()

But I don't understand how to do a similar thing for all users in the original dataframe (df) with a vectorized (not for loop) approach.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Wenuka
  • 887
  • 2
  • 9
  • 25

2 Answers2

1

I have gone long hand, there maybe a shorter way out there. Lets try

df=df.sort_values(by=['uid','date_val'])# Sort df


#Check sequence
df=(df.assign(diff=df['date_val'].diff().dt.days,
              diff1=df['date_val'].diff(-1).dt.days))

#create a grouper
s=(((df['diff'].isna())&(df['diff1']==-1))|((df['diff'].gt(1))&(df['diff1']==-1))).cumsum()

#Get streak length
df['streak'] =df.groupby([s,'uid'])['date_val'].transform('count')

#Isolate max streak
new=df[df['streak'] ==df.groupby('uid')['streak'].transform('max')].drop(columns=['diff','diff1']).sort_values(by=['uid','date_val'])
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Thanks for your answer, but I think it has a small flow when the next users (say uid=2) first date was the uid=1 's last day plus 1. With your help, I found a better answer and hope it's okay to add it below. – Wenuka Jan 23 '22 at 11:23
  • 1
    Go for it man. You understand the problem more than I do. Mine was an attempt to help. Happy coding :-) – wwnde Jan 23 '22 at 11:24
1

With the help of @wwnde's answer, I found the below answer. Posting it in case someone finds it useful.

df.sort_values(["uid", "date_val"], inplace=True)  # sort the df

df["diff1"] = df.date_val.diff().dt.days  # new column to store the date diff

df["user_segments"] = ((df.diff1 != 1)|(df.uid != df.uid.shift(-1))).cumsum()  # to create groups of consecutive days

df.groupby(["uid", "user_segments"])["date_val"].count().reset_index()\ # now date_val column have consecutive day counts
.groupby("uid")["date_val"].max()\  # then group by and get the max for each user
.sort_values(ascending=False).iloc[:100]  # finally sort it and get the first 100 users
Dharman
  • 30,962
  • 25
  • 85
  • 135
Wenuka
  • 887
  • 2
  • 9
  • 25