0

I'm trying to calculate how long one person stays in a homeless shelter using R. The homeless shelter has two different types of check-ins, one for overnight and another for a long-term. I would like to shape the data to get an EntryDate and ExitDate for every stay which does not have at least a one day break.

Here are what the data currently look like:

PersonalID        EntryDate        ExitDate
    1             2016-12-01       2016-12-02
    1             2016-12-03       2016-12-04
    1             2016-12-16       2016-12-17
    1             2016-12-17       2016-12-18
    1             2016-12-18       2016-12-19
    2             2016-10-01       2016-10-20
    2             2016-10-21       2016-10-22
    3             2016-09-01       2016-09-02
    3             2016-09-20       2016-09-21

Ultimately, I'm trying to get the above date to represent continuous ranges to calculate total length of stay by participant.

For example, the above data would become:

PersonalID      EntryDate   ExitDate
1               2016-12-01  2016-12-04
1               2016-12-16  2016-12-19
2               2016-10-01  2016-10-22
3               2016-09-01  2016-09-02
3               2016-09-20  2016-09-21
C. Thomas Brittain
  • 376
  • 1
  • 5
  • 12

2 Answers2

0

Here is an ugly solution. It is probably possible to do something more clean... But it works. This solution should alaso be debugged with real data (I have added one line to your exaple to have more different situations)

d <- read.table(text = '
PersonalID        EntryDate        ExitDate
    1             2016-12-01       2016-12-02
    1             2016-12-03       2016-12-04
    1             2016-12-16       2016-12-17
    1             2016-12-17       2016-12-18
    1             2016-12-18       2016-12-19
    2             2016-10-01       2016-10-20
    2             2016-10-21       2016-10-22
    3             2016-09-01       2016-09-02
    3             2016-09-20       2016-09-21
    4             2016-09-20       2016-09-21
', header = TRUE)

#' transorm in Date format
d$EntryDate <- as.Date(as.character(d$EntryDate))
d$ExitDate <- as.Date(as.character(d$ExitDate))
summary(d)

#' Reorder to be sure that the ExitDate / Entry date are in chronological order
d <- d[order(d$PersonalID, d$EntryDate),]
#' Add a column that will store the number of days between one exit and the next entry
d$nbdays <- 9999

# Split to have a list with dataframe for each ID
d <- split(d, d$PersonalID)
d

for(i in 1:length(d)) {

    # Compute number of days between one exit and the next entry (only if there are 
    # more than one entry)
    if(nrow(d[[i]])>1) {
        d[[i]][-1,"nbdays"] <- d[[i]][2:nrow(d[[i]]),"EntryDate"] - 
            d[[i]][1:(nrow(d[[i]])-1),"ExitDate"]
    }

    x <- d[[i]] # store a copy of the data to lighten the syntax

    # Entry dates for which the previous exit is higher than 1 day (including the first one)
    entr <- x[x$nbdays>1,"EntryDate"]

    # Exit dates just before cases where nbdays are > 1 and includes the last exit date. 
    # We use unique to avoid picking 2 times the last exit
    whichexist <- unique(c(c(which(x$nbdays > 1)-1)[-1],nrow(x)))
    exit <- x[whichexist,"ExitDate"]

    d[[i]] <- data.frame(
        PersonalID = x[1,1], 
        EntryDate = entr,
        ExitDate = exit
    )
}

# paste the elements of this list into one data.frame
do.call(rbind, d)
Gilles San Martin
  • 4,224
  • 1
  • 18
  • 31
  • This seems to work and is really fast. Still trying to understand it as it isn't the most readable of solutions, but I'll get there. :) Thank you, my friend. – C. Thomas Brittain Apr 23 '17 at 23:08
0

Here a solution using dplyr.

library(dplyr)

d = structure(list(PersonalID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
3L), EntryDate = structure(c(17136, 17138, 17151, 17152, 17153, 
17075, 17095, 17045, 17064), class = "Date"), ExitDate = structure(c(17137, 
17139, 17152, 17153, 17154, 17094, 17096, 17046, 17065), class = "Date")), class = "data.frame", .Names = c("PersonalID", 
"EntryDate", "ExitDate"), row.names = c(NA, -9L))

First create a temporary dataframe to hold all the dates between entry and exit date:

d2 = d %>%
    rowwise() %>%
    do(data.frame(PersonalID = .$PersonalID, Present = seq(.$EntryDate, .$ExitDate, by = 'day'))) %>%
    unique %>% ## remove double dates when exit and re-entry occur on the same day
    ungroup()

Then look for all the consecutive dates with some inpiration from https://stackoverflow.com/a/14868742/827766

d2 %>%
    group_by(PersonalID) %>%
    mutate(delta = c(1, diff(as.Date(Present)))) %>%
    group_by(PersonalID, stay = cumsum(delta!=1)) %>%
    summarize(EntryDate = min(Present), ExitDate = max(Present)) %>%
    subset(select = -c(stay)) 
Community
  • 1
  • 1
Johan
  • 810
  • 6
  • 12
  • I couldn't get this to provide the result I'm looking. Also, I failed to mention the dataset may be enormous--when I applied the above solution it took twenty-five minutes for the first dplyr pass. – C. Thomas Brittain Apr 23 '17 at 23:09
  • That's right this solution doesn't scale well to large datasets as it's very memory inefficient (only tested with the given example) – Johan Apr 24 '17 at 11:44