0

I have a dataset that represents patients visits to various doctors in a certain practice throughout a year.

Example-

doctor    patient_no  datetime

dr.kahn   1561        1/21/19 10:30:00
dr.gould  1397        2/06/19 12:30:00
dr.amoor  1596        2/11/19 9:00:00
dr.gould  995         10/07/19 12:30:00
dr.kahn   1561        10/14/19 9:30

I'm trying to create a new dataframe where each row is a unique doctor-patient pairing and shows the numbers of times that patient visited that doctor, along with the average time surpassed between visits for that particular patient-doctor pairing. So for instance if patient A went to dr.kahn 4 times in a year, what was the average amount of time in between patient A's consecutive appointments to dr.kahn.

Example-

doctor   patient_no   number_of_visits  avg_time_passed_between_appointments

dr.gould   1054       7                 2 months 1 days  2:00:00
dr.gould   1099       2                 5 months 10 days 00:00:00
dr.kahn    875        12                0 months 26 days 0:30:00

Any help would be appreciated. Thanks!

neilfws
  • 32,751
  • 5
  • 50
  • 63
aport550
  • 119
  • 1
  • 9
  • Tags do not belong in the question title unless they are part of a proper sentence. The code tag is enough. It's easier to help you if you share your data in a [reproducible format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that can be copy/pasted into R. Ideally your sample output should match the sample input. What happens when they have only one visit? – MrFlick Feb 19 '20 at 04:44

1 Answers1

0

Here's a dplyr approach:

library(tidyverse)
df %>%
  mutate(datetime = lubridate::mdy_hm(datetime)) %>%
  group_by(doctor, patient_no) %>%
  summarize(count = n(),
            avg_days_between = (max(datetime) - min(datetime)) / lubridate::ddays(count - 1)) %>%
  ungroup()

## A tibble: 4 x 4
#  doctor   patient_no count avg_days_between
#  <chr>         <dbl> <int>            <dbl>
#1 dr.amoor       1596     1             NaN 
#2 dr.gould        995     1             NaN 
#3 dr.gould       1397     1             NaN 
#4 dr.kahn        1561     2             266.

Or you could calculate each lag and use a different method, like the median, to characterize avg delta.

df %>%
  group_by(doctor, patient_no) %>%
  mutate(datetime = lubridate::mdy_hm(datetime),
         # coalesce helps
         days_since_last = coalesce(c(datetime - lag(datetime))/
                                      lubridate::ddays(1), 0)) %>%
  summarize(count = n(),
            median_time_between = median(days_since_last))

## A tibble: 4 x 4
# Groups:   doctor [3]
#  doctor   patient_no count median_time_between
#  <chr>         <dbl> <int>               <dbl>
#1 dr.amoor       1596     1                  0 
#2 dr.gould        995     1                  0 
#3 dr.gould       1397     1                  0 
#4 dr.kahn        1561     2                133.

sample data

df <- tibble::tribble(
     ~doctor, ~patient_no,       ~datetime,
   "dr.kahn",        1561, "1/21/19 10:30",
  "dr.gould",        1397,  "2/6/19 12:30",
  "dr.amoor",        1596,  "2/11/19 9:00",
  "dr.gould",         995, "10/7/19 12:30",
   "dr.kahn",        1561, "10/14/19 9:30"
  )
Jon Spring
  • 55,165
  • 4
  • 35
  • 53