As KppatelPatel suggested, I like lubridate
for this. Libraries and data:
library(lubridate)
library(dplyr)
registry <- read.table(text = 'userId cohortId funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)
workouts <- read.table(text = 'userId cohortId funnelStep week_number
8991eb20 2017-10-23 completeWorkout 4
34ed55c1 2017-10-18 completeWorkout 4', header = TRUE)
Your data may already have your weeks as a date object (read.table
reads the example dates as factors by defaults)- if not:
registry$cohortId <- as.Date(registry$cohortId)
workouts$cohortId <- as.Date(workouts$cohortId)
Make a function to check if the time between dates is greater than a month. The function steps are:
- Join the registry and workouts tables together, based on the user id column
- Make a new column called 'active.1st.month' and assign to that column the result of the logical test that checks if the difference between registry cohort id and workouts cohort id is greater than one month
- build a dataframe containing only the columns
userId
, cohortId.x
, cohortId.y
, and active.1st.month
, and rename them to be more descriptive
- return that nicely-named dataframe
check_activity <- function(reg.df, work.df){
reg.work <- inner_join(registry, workouts, by = "userId")
reg.work$active.1st.month <-
(reg.work$cohortId.y - reg.work$cohortId.x) < as.duration(months(1))
reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.1st.month")]
names(reg.work) <- c("user", "registered", "workout", "active.1st.month")
return(reg.work)
}
> check_activity(registry, workouts)
user registered workout active.1st.month
1 8991eb20 2017-10-23 2017-10-23 TRUE
2 34ed55c1 2017-08-24 2017-10-18 FALSE
Of course, you can change months(1)
to be whatever length of time you prefer (eg, weeks(4)
).
EDIT:
As per your comment, I thought it might be easiest to just return the month in which the user was first active (had their first completed workout). With this new fake data:
registry <- read.table(text = 'userId cohortId funnelStep
8991eb20 2017-10-23 registration
example1 2017-10-23 registration
example2 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)
workouts <- read.table(text = 'userId cohortId funnelStep week_number
8991eb20 2017-10-23 completeWorkout 4
example1 2017-10-28 completeWorkout 4
example2 2017-11-28 completeWorkout 4
34ed55c1 2017-12-18 completeWorkout 4', header = TRUE)
Now change the function slightly so it outputs the number of the month in which a given user was first active (had a completed workout).
check_active_month <- function(reg.df, work.df){
reg.work <- inner_join(registry, workouts, by = "userId")
reg.work$active.month <-
1 + floor(as.duration(workouts$cohortId - registry$cohortId) / as.duration(months(1)))
reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.month")]
names(reg.work) <- c("user", "registered", "workout", "active.month")
return(reg.work)
}
Now you can summarize the length of users with a given active.month
to output the number of users who are active in their first month after registration, second month, etc:
check_active_month(registry, workouts)
active.months.df %>%
group_by(active.month) %>%
summarise(n.users.active.month = length(active.month))
# A tibble: 3 x 2
active.month n.users.active.month
<dbl> <int>
1 1 2
2 2 1
3 4 2