7

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.

> test
   user_id period
1        1      1
2        5      1
3        1      1
4        3      1
5        4      1
6        2      2
7        3      2
8        2      2
9        3      2
10       1      2
11       5      3
12       5      3
13       2      3
14       1      3
15       4      3
16       5      4
17       5      4
18       5      4
19       4      4
20       3      4

For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.

The output would be the following:

> output
  period retention
1      1        NA
2      2     0.500
3      3     0.666
4      4     0.500

The test data:

> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5, 
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
iskandarblue
  • 7,208
  • 15
  • 60
  • 130

3 Answers3

4

How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.

splt <- split(test$user_id, test$period)

carryover <- function(x, y) {
    length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])

        1         2         3 
0.5000000 0.6666667 0.5000000 
Daniel Anderson
  • 2,394
  • 13
  • 26
4

Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:

test %>% 
group_by(period) %>% 
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>% 
mutate(retention=lag(retention))

This returns:

period retention
   <dbl>     <dbl>
1      1        NA
2      2 0.5000000
3      3 0.6666667
4      4 0.5000000
Lamia
  • 3,845
  • 1
  • 12
  • 19
  • Wouldn't it be better if you calculate length of unique user_ids? I think that if a user appears twice in a period it should still count as one returning user. – Giacomo Jul 24 '17 at 15:17
  • @giac_man By default, `intersect` removes any duplicates, so there's no need to add `unique` as the operation is already included in it. – Lamia Jul 25 '17 at 12:04
  • Ah, I see. Thanks – Giacomo Jul 25 '17 at 12:46
  • can I also ask you why you have used period+1 and then apply function lag. Couldn't you do period -1 instead? – Giacomo Jul 25 '17 at 12:48
  • @giac_man The retention rate for period 2 is the number of unique users present in both period 1 and 2, divided by the number of unique users in period 1. If you do what you describe, the numerator would be correct but you'd be dividing by the number of unique users in period 2 instead of period 1, which doesn't give you the desired result. – Lamia Jul 27 '17 at 18:06
0

This isn't so elegant but it seems to work. Assuming df is the data frame:

# make a list to hold unique IDS by 
uniques = list()
for(i in 1:max(df$period)){
  uniques[[i]] = unique(df$user_id[df$period == i])
}

# hold the retention rates
retentions = rep(NA, times = max(df$period))

for(j in 2:max(df$period)){
  retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}

Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

svenhalvorson
  • 1,090
  • 8
  • 21