1

The above code calculates the cohort-wise retention rates. The cohort is the month of joining. So, the code caluclates the number customers who joined in the month of May 2015, how many are active month on month. The final output is stored in data frame df1 (shown below)

I need help creating dynamic column names which are currently hard coded in the ddply function. M0 which means month of joining, M1 means 1st month from joining, M2 means 2 month from joining to M(n) should be variables. This can calculated by subtracting the farthest expiry date from the earliest joining date.

Unfortunately, I'm not able to auto calculate M0 to M(n) range dynamically.

Here is my code dump which works but is not optimal because I have hardcoded M0 to M3 as a variable in ddply function. So if my input data has a customer with a longer subscription period than 5 months my code will fail.

The input to the code is the following dummy data.

customer    dj       exp
abc      01/05/15   25/06/15
efg      01/05/15   25/07/15
ghd      01/05/15   25/07/15
mkd      01/06/15   25/07/15
kskm     01/06/15   05/08/15

Reproducible code.

    library(zoo)
    library(plyr)

    customer<-c("abc","efg","ghd","mkd","kskm")
    dj<-c("2015-05-01", "2015-05-01", "2015-05-01","2015-06-01","2015-06-01")
    exp<-c("2015-06-25", "2015-07-25", "2015-07-25","2015-07-01","2015-08-05")
    data.frame(customer,dj,exp)
    df$dj <- as.Date(df$dj,"%d/%m/%y")
    df$exp <- as.Date(df$exp,"%d/%m/%y")

    # The data in the file has different variable names than your example data
    # so I'm changing them to match
    names(df)[1:3] <- c("customer","dj","exp")

    # Make a variable called Cohort that contains only the year and month of joining
    # as.yearmon() comes from the 'zoo' package
    df$Cohort <- as.yearmon(df$dj)

    # Calculate the difference in months between date of expiry and date of joining
    df$MonthDiff <- ceiling((df$exp-df$dj)/30)
    #df$MonthDiff <- 12*(as.yearmon(df$exp+months(1))-df$Cohort)

    range<-as.integer(ceiling((max(df$exp)-min(df$dj)))/30)

    # Use ddply() from the 'plyr' package to get the frequency of subjects that are
    # still active after 0, 1, 2, and 3 months.

    df1 <- ddply(df,.(Cohort),summarize,
                 M0 = sum(MonthDiff > 0), 
                 M1 = sum(MonthDiff > 1),
                 M2 = sum(MonthDiff > 2),
                 M3 = sum(MonthDiff > 3)

    )

 df1


df1
    Cohort M0 M1 M2 M3 
1 May 2015  3  3  2  0  
2 Jun 2015  2  2  1  0 

The above is the output working output. Ask is to make column M0 to M3 dynamic

Vaibhav Jha
  • 99
  • 2
  • 10
  • What about this? http://stackoverflow.com/questions/3784187/how-to-use-string-variables-to-create-variables-list-for-ddply – Roman Luštrik Dec 02 '15 at 11:56
  • @RomanLuštrik That question is about using a character vector for pass the grouping variables to `ddply`. This question is about using names of columns in `summarise`. I believe it to be a duplicate of http://stackoverflow.com/questions/16319509/passing-parameters-to-r-function-used-plyr – Jan van der Laan Dec 02 '15 at 12:30
  • @JanvanderLaan could be, but at this point, I think I would like to see a reproducible example. Hopefully OP will hear me. :) – Roman Luštrik Dec 02 '15 at 13:31
  • @RomanLuštrik din't understand. Could you help? – Vaibhav Jha Dec 02 '15 at 13:42
  • @JanvanderLaan Guys I'm new to R. Please help. Went through the posts but I understood that summerise should be used. Then whats the alternative? – Vaibhav Jha Dec 02 '15 at 13:47
  • 3
    Perhaps if you provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) we can offer some more assistance? – Roman Luštrik Dec 02 '15 at 13:58
  • df1 <- ddply(df,.(Cohort),summarize, M0 = sum(MonthDiff > 0), M1 = sum(MonthDiff > 1), M2 = sum(MonthDiff > 2), M3 = sum(MonthDiff > 3), M4 = sum(MonthDiff > 4), M5 = sum(MonthDiff > 5) ) I would like M0 to M4 to be variables. These are hardcoded right now. – Vaibhav Jha Dec 02 '15 at 14:10
  • @RomanLuštrik Please let me know if you understood. 1. I have csv input file with customer name and his start and expiry date 2. In the code above I check how many customers are active month on month. 3. Right now I have hard coded the value M0 month he joined, M1 first month from joining in ddply function 4. I would like M0 to M5 to be dynamicaly calculated based on expiry - start date. – Vaibhav Jha Dec 02 '15 at 14:20
  • 1
    Please incorporate anything that is relevant to the question into the question itself. Especially the code. Please follow the link I provided above to learn how to provide code _and_ data to reproduce what you're after. – Roman Luštrik Dec 02 '15 at 14:23
  • @RomanLuštrik I have made edits so that you can understand better. Please let me know. – Vaibhav Jha Dec 02 '15 at 14:37
  • _Please_ read the link about reproducible example. Your question should be formulated so that one can easily copy code and data into own R session. – Roman Luštrik Dec 02 '15 at 14:48
  • @RomanLuštrik you can now copy the code. Added the data frame that I was earlier inputting from csv to code. Thanks. Look forward to your solution. – Vaibhav Jha Dec 02 '15 at 15:10

1 Answers1

0

Try inserting this after creating range:

for(i in 0:range) df <- within(df,assign(paste0("M",i),MonthDiff>i))

df1 <- ddply(df,.(Cohort),function(x) colSums(x[,paste0("M",0:range)]))

df1
#     Cohort M0 M1 M2 M3
# 1 May 2015  3  3  2  0
# 2 Jun 2015  2  1  1  0
Sam Dickson
  • 5,082
  • 1
  • 27
  • 45