I have the following df in R:
| date_entered | person_id |
| ------------ | ----------- |
| 2021-06-01 | 12300001 |
| 2021-06-01 | 12300002 |
| 2021-06-01 | 12300003 |
| 2021-06-02 | 12300004 |
| 2021-06-02 | 12300001 |
| 2021-06-02 | 12300010 |
| 2021-06-02 | 12300011 |
| 2021-06-03 | 12301000 |
| 2021-06-03 | 12300002 |
I would like to count the number of new and returning person_id each day.
I tried creating a column identifying the first date the person_id first get entered:
merged_qa <- df %>%
group_by(person_id) %>%
mutate(date_of_first_entered=min(date_entered)) %>%
ungroup()
but the code produces a column with the value of the earliest date in the date_entered without the group_by/regardless of the person_id:
output:
| date_entered | person_id | date_first_entered |
| ------------ | ----------- | ------------------ |
| 2021-06-01 | 12300001 | 2021-06-01 |
| 2021-06-01 | 12300002 | 2021-06-01 |
| 2021-06-01 | 12300003 | 2021-06-01 |
| 2021-06-02 | 12300004 | 2021-06-01 |
| 2021-06-02 | 12300001 | 2021-06-01 |
| 2021-06-02 | 12300010 | 2021-06-01 |
| 2021-06-02 | 12300011 | 2021-06-01 |
| 2021-06-03 | 12301000 | 2021-06-01 |
| 2021-06-03 | 12300002 | 2021-06-01 |
expected:
date_entered | person_id | date_first_entered |
---|---|---|
2021-06-01 | 12300001 | 2021-06-01 |
2021-06-01 | 12300002 | 2021-06-01 |
2021-06-01 | 12300003 | 2021-06-01 |
2021-06-02 | 12300004 | 2021-06-02 |
2021-06-02 | 12300001 | 2021-06-01 |
2021-06-02 | 12300010 | 2021-06-02 |
2021-06-02 | 12300011 | 2021-06-02 |
2021-06-03 | 12301000 | 2021-06-03 |
2021-06-03 | 12300002 | 2021-06-01 |
if I can get the expected table, I can therefore label if it is a new or prior customer with the following code (i think):
merged_qa <- merged_qa %>%
mutate(person_status = case_when
(date_entered > date_of_first_entered ~ "prior",
date_entered == date_of_first_entered ~ "new",
TRUE ~ "Other"))
where in the end, I can create a new df with the following code:
new_and_prior <- merged_qa %>%
group_by(floor_date(date_entered,unit = 'days')) %>%
summarise(eod_count=
n_distinct(person_id),
prior =
n_distinct(person_id[person_status=='prior']),
new_person =
n_distinct(person_id[person_status=='new'])
)
date_entered | eod_count | prior | new |
---|---|---|---|
2021-06-01 | 3 | 0 | 3 |
2021-06-02 | 4 | 1 | 3 |
2021-06-03 | 2 | 1 | 1 |
I need help troubleshooting the group_by code above that will created the date of the first time a person_id is entered in the df.
OR
Can you suggest a better way to code this?