1

I'm working with a large electronic database which records user driven events. Essentially, I want to obtain the proportion/percentages of new users to the service on a monthly basis for an entire year. Below is just a mock example of the data:-

    UserId    Month   UserEventId

    Tyrhjj01   Jan     0998907
    Fghhey21   Jan     0989892
    Hyhkio52   Jan     7782901
    hejdoe78   Jan     3889201
    Tyrhjj01   Feb     7829930
    sjjwilsn   Feb     7728910
    Tyrhjj01   Feb     9203749
    nnkilo89   Feb     7728912
    Fghhey21   Feb     4463782

...And so on. As you can see, there are clients who use the service regularly, while some clients are unique in the month of Feb. I want to get a percentage of clients who are old and clients who are unique to the system.I'm attaching an illustration, to help for better understanding.

Percentage of new vs old clients:

img.

I have tried a couple of examples from dplyr and data.table but to no avail. Any help would be greatly appreciated!

camille
  • 16,432
  • 18
  • 38
  • 60
metaltoaster
  • 380
  • 2
  • 15
  • 2
    What have you tried so far, and what code did you write to get that plot? How exactly are you defining old vs new? – camille Oct 05 '18 at 16:26
  • @camille Okay so firstly I tried (forgive me, still relatively new to R): setDT(df)[,.(count=uniqueN(UserId)), by=Month] Didn't have the desired outcome. Then I tried with dplyr:- df %>% group_by(Month) %>% summarise(n = n_distinct(UserId)) Thinking that I may be able to calculate by hand the proportion of new and old clients by hand. But still not desired outcome. – metaltoaster Oct 05 '18 at 16:29
  • 1
    Please share a sample of your data using `dput()` (not `str` or `head` or picture/screenshot) and the code you're working on so others can help. See more here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?rq=1 – Tung Oct 05 '18 at 16:30
  • @camille Old Clients are clients in the current month who have used the service before, new clients are defined as those clients in the current month who have not used the service before. Hope this clarifies things :) – metaltoaster Oct 05 '18 at 16:37
  • @Tung thank you. The picture is just there to help illustrate to others what I am trying to achieve. – metaltoaster Oct 05 '18 at 16:39
  • It's easier to follow if you edit the question to include more specifics and the code you're working on – camille Oct 05 '18 at 16:48
  • Are you asking for a graphing or calculation? Please Edit post (see link below question). – Parfait Oct 05 '18 at 16:57

1 Answers1

1

If you create a new dataset with unique users for each month, you can then use rowid from data.table to see whether they are present in df during previous months.

library(data.table)
setDT(df)

users <- df[, .(user = unique(UserId)), Month] 
users[, visit := rowid(user)] # create variable for number of months user has visited
users[, .(new_pct = mean(visit == 1)), Month] 

#    Month new_pct
# 1:   Jan     1.0
# 2:   Feb     0.5

Or with tidyverse

Edit: The solution below doesn't work if your Month column is actually character month names. As seen below, dplyr grouping reorders your data (unlike data.table), and so produces incorrect results with this method. I'll leave the code below since it works if Month is a date class column.

df %>% 
  group_by(Month) %>% 
  do(user = unique(.$UserId)) %>% 
  unnest %>% 
  group_by(user) %>% 
  mutate(visit = row_number()) %>% 
  group_by(Month) %>% 
  summarise(new_pct = mean(visit == 1))

# # A tibble: 2 x 2
#   Month new_pct
#   <chr>   <dbl>
# 1 Feb     1.00 
# 2 Jan     0.500

Data used:

df <- fread("
UserId    Month   UserEventId
Tyrhjj01   Jan     0998907
Fghhey21   Jan     0989892
Hyhkio52   Jan     7782901
hejdoe78   Jan     3889201
Tyrhjj01   Feb     7829930
sjjwilsn   Feb     7728910
Tyrhjj01   Feb     9203749
nnkilo89   Feb     7728912
Fghhey21   Feb     4463782
")
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38