count of new and returning customers in R

Question

I have the following df in R:

| date_entered | person_id   | 
| ------------ | ----------- |
| 2021-06-01   | 12300001    |
| 2021-06-01   | 12300002    |
| 2021-06-01   | 12300003    |
| 2021-06-02   | 12300004    |
| 2021-06-02   | 12300001    |
| 2021-06-02   | 12300010    |
| 2021-06-02   | 12300011    |
| 2021-06-03   | 12301000    |
| 2021-06-03   | 12300002    |

I would like to count the number of new and returning person_id each day.

I tried creating a column identifying the first date the person_id first get entered:

merged_qa <- df %>%
  group_by(person_id) %>%
  mutate(date_of_first_entered=min(date_entered)) %>%
  ungroup()

but the code produces a column with the value of the earliest date in the date_entered without the group_by/regardless of the person_id:

output:

| date_entered | person_id   | date_first_entered |
| ------------ | ----------- | ------------------ |
| 2021-06-01   | 12300001    | 2021-06-01         |
| 2021-06-01   | 12300002    | 2021-06-01         |
| 2021-06-01   | 12300003    | 2021-06-01         |
| 2021-06-02   | 12300004    | 2021-06-01         |
| 2021-06-02   | 12300001    | 2021-06-01         |
| 2021-06-02   | 12300010    | 2021-06-01         |
| 2021-06-02   | 12300011    | 2021-06-01         |
| 2021-06-03   | 12301000    | 2021-06-01         |
| 2021-06-03   | 12300002    | 2021-06-01         |

expected:

date_entered	person_id	date_first_entered
2021-06-01	12300001	2021-06-01
2021-06-01	12300002	2021-06-01
2021-06-01	12300003	2021-06-01
2021-06-02	12300004	2021-06-02
2021-06-02	12300001	2021-06-01
2021-06-02	12300010	2021-06-02
2021-06-02	12300011	2021-06-02
2021-06-03	12301000	2021-06-03
2021-06-03	12300002	2021-06-01

if I can get the expected table, I can therefore label if it is a new or prior customer with the following code (i think):

merged_qa <- merged_qa %>%
  mutate(person_status = case_when
         (date_entered > date_of_first_entered ~ "prior",
         date_entered == date_of_first_entered ~ "new",
         TRUE ~ "Other"))

where in the end, I can create a new df with the following code:

new_and_prior <- merged_qa %>%
  group_by(floor_date(date_entered,unit = 'days')) %>%
  summarise(eod_count=
              n_distinct(person_id),
            prior = 
              n_distinct(person_id[person_status=='prior']),
            new_person =
              n_distinct(person_id[person_status=='new'])
            )

date_entered	eod_count	prior	new
2021-06-01	3	0	3
2021-06-02	4	1	3
2021-06-03	2	1	1

I need help troubleshooting the group_by code above that will created the date of the first time a person_id is entered in the df.

OR

Can you suggest a better way to code this?

When I try using `df2 %>% group_by(person_id) %>% mutate(new_date = min(date_entered)) %>% ungroup()` I get your expected output... — Martin Gal, Jul 03 '21 at 20:59
The issue would be related to loading `plyr` which masks the mutate from dplyr with plyr::mutate. Try explicitly `dplyr::mutate(date_of_first_entered=min(date_entered))` — akrun, Jul 03 '21 at 21:13
https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with — Ronak Shah, Jul 04 '21 at 02:30

score 3 · Accepted Answer · answered Jul 03 '21 at 21:14

Here is a solution using base R. This skips the problem of identifying the first arrival date for each customer and jumps straight to your bottom table.
See comments for a step-by-step instructions.

df<- structure(list(date_entered = c("2021-06-01", "2021-06-01", "2021-06-01",  "2021-06-02", "2021-06-02",
                                     "2021-06-02", "2021-06-02", "2021-06-03",  "2021-06-03"), 
                    person_id = c(12300001L, 12300002L, 12300003L, 12300004L, 12300001L, 12300010L, 12300011L, 
                                  12301000L, 12300002L)), class = "data.frame", row.names = c(NA, -9L))

#add a column identifying duplicates
df$returning <- duplicated(df$person_id)

#create a table if the Returning and new vs data
answer <- (table(df$date_entered, df$returning) )
#add the row and column sums
answer <- addmargins(answer)

#convert to data.frame
as.data.frame.matrix(answer) 

            FALSE TRUE Sum
 2021-06-01     3    0   3
 2021-06-02     3    1   4
 2021-06-03     1    1   2
 Sum            7    2   9

One will need to change the column names from False/True to prior/new

akrun · Answer 2 · 2021-07-03T21:23:33.610

We could do this in a single pipe. Grouped by 'person_id', create the 'person_status' column with case_when, then change the grouping to 'date_entered' and summarise by getting the total number of rows (n()), sum of logical vector for 'prior' and 'new'

library(dplyr)
df %>%
    group_by(person_id) %>% 
    mutate(person_status = case_when(date_entered == min(date_entered) ~ "new",
        date_entered > min(date_entered) ~ "prior", TRUE ~ "Other")) %>% 
    group_by(date_entered) %>%
    summarise(eod_count = n(), 
       prior = sum(person_status == 'prior'), 
         new = sum(person_status == 'new'))

-output

# A tibble: 3 x 4
  date_entered eod_count prior   new
  <date>           <int> <int> <int>
1 2021-06-01           3     0     3
2 2021-06-02           4     1     3
3 2021-06-03           2     1     1

Regarding the OP's issue with mutate returning single value, it can be replicated if we use plyr::mutate

 df %>% 
    group_by(person_id) %>%
    plyr::mutate(date_of_first_entered=min(date_entered))
# A tibble: 9 x 3
# Groups:   person_id [7]
  date_entered person_id date_of_first_entered
  <date>           <int> <date>               
1 2021-06-01    12300001 2021-06-01           
2 2021-06-01    12300002 2021-06-01           
3 2021-06-01    12300003 2021-06-01           
4 2021-06-02    12300004 2021-06-01           
5 2021-06-02    12300001 2021-06-01           
6 2021-06-02    12300010 2021-06-01           
7 2021-06-02    12300011 2021-06-01           
8 2021-06-03    12301000 2021-06-01           
9 2021-06-03    12300002 2021-06-01

This relates to masking of dplyr::mutate with plyr::mutate when both of the packages are loaded. Can rectify it with either explicit specification of dplyr::mutate instead of just mutate or do this on a fresh R session with only dplyr loaded or another option is to assign the dplyr::mutate to a new object and call that

dmutate <- dplyr::mutate
df %>% 
    group_by(person_id) %>%
    dmutate(date_of_first_entered=min(date_entered))
# A tibble: 9 x 3
# Groups:   person_id [7]
  date_entered person_id date_of_first_entered
  <date>           <int> <date>               
1 2021-06-01    12300001 2021-06-01           
2 2021-06-01    12300002 2021-06-01           
3 2021-06-01    12300003 2021-06-01           
4 2021-06-02    12300004 2021-06-02           
5 2021-06-02    12300001 2021-06-01           
6 2021-06-02    12300010 2021-06-02           
7 2021-06-02    12300011 2021-06-02           
8 2021-06-03    12301000 2021-06-03           
9 2021-06-03    12300002 2021-06-01

data

df <- structure(list(date_entered = structure(c(18779, 18779, 18779, 
18780, 18780, 18780, 18780, 18781, 18781), class = "Date"), 
person_id = c(12300001L, 
12300002L, 12300003L, 12300004L, 12300001L, 12300010L, 12300011L, 
12301000L, 12300002L)), row.names = c(NA, -9L), class = "data.frame")

Hi, I used your single pipe, but I am getting this error "Error: `n()` must only be used inside dplyr verbs." When I put dplyr::summarise all entries were labeled prior and therefore creates 0 news. How do I go about this? Thank you — doubleD, Jul 29 '21 at 00:55
@DennisDavid it works fine for me. Probably you have an older package version. I used `packageVersion('dplyr')# [1] ‘1.0.7’` — akrun, Jul 29 '21 at 00:56

count of new and returning customers in R

2 Answers2

data