Difference between dates in different columns in r

Question

I'm new to R-programming and I wanted to calculate if a user was active in the month he registered. Therefore I have two tables- one called workouts and the other registrations. Users are classified by cohorts in the CohortId column. What I wanted to do know was to calculate the difference between the cohortId- dates of registrations and workouts to see if a user is active in the month of their first registration.

This is what I have so far:

week_difference <- function(end_date, start_date){
    as.integer(difftime(head(strptime(end_date, format = "%Y-%m-%d"), 1),
               tail(strptime(start_date, format = "%Y-%m-%d"),1), units = "weeks"),0)
}


retention_week <- funnel_workout %>% group_by(userId) %>%  select(userId, cohortId) %>% 
  mutate(week_number = if(!is.na(cohortId)){week_difference(funnel_registration$cohortId, funnel_workout$cohortId)}else{print(NA)})

The problem is that week_number is always 4 and doesn't really calculate the difference between the dates.

Thanks in advance for any kind of help!

EDIT:

Here is the registrations df:

userId   cohortId   funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration

and the workouts df:

userId   cohortId      funnelStep week_number
8991eb20 2017-10-23 completeWorkout           4
34ed55c1 2017-10-18 completeWorkout           4

Can you provide an [example of each data frame?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-aka-mcve-minimal-complete-and-ver) — Luke C, Nov 24 '17 at 23:23
@LukeC here it is :) I really feel like R and the necessary packages make everything more complicated than they are right now... — heythere, Nov 24 '17 at 23:50
Right on, thanks. So is `week_number` in the `workouts` data frame the output from your function? — Luke C, Nov 24 '17 at 23:53
look in lubridate package, it will make life easier if you are working with date — Kush Patel, Nov 25 '17 at 00:07
@KppatelPatel I've heard of lubridate and have been using it at some other places but I don't know how it would help me with my problem. the week_difference function is right when I type in some dummy dates but I don't know why it always prints 4 when trying calculate the diff of two columns. — heythere, Nov 25 '17 at 00:31

Luke C · Accepted Answer · 2017-11-25T19:54:36.793

As KppatelPatel suggested, I like lubridate for this. Libraries and data:

library(lubridate)
library(dplyr)

registry <- read.table(text = 'userId   cohortId   funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)

workouts <- read.table(text = 'userId   cohortId      funnelStep week_number
8991eb20 2017-10-23 completeWorkout           4
34ed55c1 2017-10-18 completeWorkout           4', header = TRUE)

Your data may already have your weeks as a date object (read.table reads the example dates as factors by defaults)- if not:

registry$cohortId <- as.Date(registry$cohortId)
workouts$cohortId <- as.Date(workouts$cohortId)

Make a function to check if the time between dates is greater than a month. The function steps are:

Join the registry and workouts tables together, based on the user id column
Make a new column called 'active.1st.month' and assign to that column the result of the logical test that checks if the difference between registry cohort id and workouts cohort id is greater than one month
build a dataframe containing only the columns userId, cohortId.x, cohortId.y, and active.1st.month, and rename them to be more descriptive
return that nicely-named dataframe

check_activity <- function(reg.df, work.df){
  reg.work <- inner_join(registry, workouts, by = "userId")
  reg.work$active.1st.month <- 
    (reg.work$cohortId.y - reg.work$cohortId.x) < as.duration(months(1))
  reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.1st.month")]
  names(reg.work) <- c("user", "registered", "workout", "active.1st.month")
  return(reg.work)
}

> check_activity(registry, workouts)
      user registered    workout active.1st.month
1 8991eb20 2017-10-23 2017-10-23             TRUE
2 34ed55c1 2017-08-24 2017-10-18            FALSE

Of course, you can change months(1) to be whatever length of time you prefer (eg, weeks(4)).

EDIT:

As per your comment, I thought it might be easiest to just return the month in which the user was first active (had their first completed workout). With this new fake data:

registry <- read.table(text = 'userId   cohortId   funnelStep
8991eb20 2017-10-23 registration
example1 2017-10-23 registration
example2 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)

workouts <- read.table(text = 'userId   cohortId      funnelStep week_number
8991eb20 2017-10-23 completeWorkout           4
example1 2017-10-28 completeWorkout           4
example2 2017-11-28 completeWorkout           4
34ed55c1 2017-12-18 completeWorkout           4', header = TRUE)

Now change the function slightly so it outputs the number of the month in which a given user was first active (had a completed workout).

check_active_month <- function(reg.df, work.df){
  reg.work <- inner_join(registry, workouts, by = "userId")
  reg.work$active.month <- 
    1 + floor(as.duration(workouts$cohortId - registry$cohortId) / as.duration(months(1)))
  reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.month")]
  names(reg.work) <- c("user", "registered", "workout", "active.month")
  return(reg.work)
}

Now you can summarize the length of users with a given active.month to output the number of users who are active in their first month after registration, second month, etc:

check_active_month(registry, workouts)
active.months.df %>% 
  group_by(active.month) %>%
  summarise(n.users.active.month = length(active.month))

# A tibble: 3 x 2
  active.month n.users.active.month
         <dbl>                <int>
1            1                    2
2            2                    1
3            4                    2

wow, thank you thats great. The only thing is that I am trying to calculate the amount of active users in 1st.month and 2nd.month etc. I will try it and update you. — heythere, Nov 25 '17 at 13:55
@heythere - Ah ok, I have modified my answer to output what I think you're after. — Luke C, Nov 25 '17 at 19:55
@heythere if the answer has resolved your issue, kindly *accept* it (answers take up valuable time for respondents) — desertnaut, Dec 03 '17 at 18:49
@LukeC thank you very much for your help. definitely learned to work more efficiently with the packages from your answer. :) — heythere, Dec 03 '17 at 22:55

Difference between dates in different columns in r

1 Answers1