0

I'm trying to figure out how to combine rows that have a single column of dates, such that the new table/data frame/tibble will have two columns: one for the start date and one for the end date, but only for consecutive dates (i.e. any gaps in dates should be separated into a new row in the new table). It would also be grouped by different categorizations.

An example of the kind of data I'm manipulating is the following:

   Person ID   Department   Date     
   351581      JE           12/1/2019
   351581      JE           12/2/2019
   351581      FR           12/2/2019
   351581      JE           12/3/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019
   615418      AB           12/22/2019

And the desired result would be:

   Person ID   Department   Start Date      End Date
   351581      JE           12/1/2019       12/3/2019
   351581      FR           12/2/2019       12/2/2019
   598168      GH           12/16/2019      12/16/2019
   351581      JE           12/8/2019       12/9/2019
   615418      AB           12/20/2019      12/20/2019
   615418      AB           12/22/2019      12/22/2019

My searches so far have turned up a couple of possibly-related questions that involve combining date ranges, but I'm not sure how they'd be applied to just a single column of dates:

Find all date ranges for overlapping start and end dates in R

Date roll-up in R

dplyr

Adding this for the benefit of future folks, I ended up applying the accepted solution with dplyr, just because I'm more comfortable with the syntax.

df %>%
  mutate(Date = as.Date(Date)) %>%
  arrange(`Person ID`, Department, Date) %>%
  group_by(`Person ID`, Department, 
           g = cumsum(c(0, diff(Date)) != 1)
           ) %>%
  summarize(Start = min(Date), End = max(Date)) %>%
  ungroup %>%
  select(-g)
ginopono
  • 23
  • 5
  • In your example data, dates are always consecutive inside a (Person_ID, Department) group. For your real data, is it sufficient to just produce start and end dates according to those groups, or do you actually need to check that the dates are consecutive and separate into more groups within those groups if not? – IceCreamToucan Jan 07 '20 at 14:45
  • @IceCreamToucan Would that just be a matter of sorting them first, perhaps just by adding in a step with dplyr's arrange()? – ginopono Jan 07 '20 at 15:08
  • No. This is all assuming the data is *already* sorted by date. A concrete example of when the two approaches differ is if there were two rows at the end with the same Person_ID and Department as the last row, and dates 12/23/2019 and 12/24/2019. Then the current answers' number of groups will not change, but the number of groups according to the logic in the question should increase by 1 (because in this hypothetical example the last two rows should be in a new group, not grouped with the 12/20/2019 row) – IceCreamToucan Jan 07 '20 at 15:10
  • I do need to check whether the dates are consecutive and split into a new group accordingly. Like you said, I think the question describes this. – ginopono Jan 07 '20 at 15:17
  • Ok thanks for clarifying. The reason I ask is that none of the current answers actually do this, because for the example you gave this is not necessary to get the desired output. If you add an additional row at the end with the same Person_ID and Department as the current last row, but with date 12/22/2019, then show the desired output having 6 rows, it would be more clear since for that modified example, simply grouping by the first two columns and ignoring dates will no longer coincidentally give the same output as the approach you're actually asking for. – IceCreamToucan Jan 07 '20 at 15:20
  • 1
    Thanks. I made that change to the example – ginopono Jan 07 '20 at 15:24

5 Answers5

4

We assume here that what is being asked is within each contiguous group of Person_ID and Departmwent we want the min and max date.

1) data.table First convert the Date columnn to Date class and then grouping by rleid(Person_ID) take the minimum and maximum values.

library(data.table)
library(lubridate)

DT <- as.data.table(DF0)
DT[, Date := mdy(Date)][
   , list(start = min(Date), end = max(Date)), 
   by = .(rleid(Person_ID, Department), Person_ID, Department)][-1]

giving:

   Person_ID Department      start        end
1:    351581         GH 2019-12-01 2019-12-03
2:    351581         FR 2019-12-02 2019-12-02
3:    598168         GH 2019-12-16 2019-12-16
4:    351581         JE 2019-12-08 2019-12-09
5:    615418         AB 2019-12-20 2019-12-20

2) Base R Convert the Date to Date class and then create a grouping variable g using rle. Then define a Range function which outputs the start and end given a group and apply it to each group.

DF <- transform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
g <- with(rle(paste(DF$Person_ID, DF$Department)), rep(seq_along(lengths), lengths))
Range <- function(x) data.frame(x[1, 1:2], start = min(x$Date), end = max(x$Date))
do.call("rbind", by(DF, g, Range))

giving:

  Person_ID Department      start        end
1    351581         GH 2019-12-01 2019-12-03
2    351581         FR 2019-12-02 2019-12-02
3    598168         GH 2019-12-16 2019-12-16
4    351581         JE 2019-12-08 2019-12-09
5    615418         AB 2019-12-20 2019-12-20

3) dplyr/data.table A mixed approach where we use rleid from data.table and otherwise use dplyr is the following. Convert the Date using lubridate and the group by rleid and, Person_ID and Department. The last two are to ensure that they are included in the output. Calculate the start and end and then remove the grouping column.

library(dplyr)
library(data.table)
library(lubridate)

DF0 %>%
  mutate(Date = mdy(Date)) %>%
  group_by(g = rleid(Person_ID, Department), Person_ID, Department) %>%
  summarize(start = min(Date), end = max(Date)) %>%
  ungroup %>%
  select(-g)

giving:

# A tibble: 5 x 4
  Person_ID Department start      end       
      <int> <fct>      <date>     <date>    
1    351581 GH         2019-12-01 2019-12-03
2    351581 FR         2019-12-02 2019-12-02
3    598168 GH         2019-12-16 2019-12-16
4    351581 JE         2019-12-08 2019-12-09
5    615418 AB         2019-12-20 2019-12-20

4) sqldf Define the group Grp in the inner select and then find the min and max Date by Grp.

library(sqldf)

DF <- trnsform(DF0, Date = as.Date(Date, "%m/%d/%Y"))

sqldf("select Person_ID, Department, min(Date) as start__Date, max(Date) as end__Date
from ( select 
    rowid r, 
    Person_ID, 
    Department, 
    Date, 
    Date - dense_rank() over (partition by Person_ID, Department order by rowid) as Grp
  from DF
) group by Grp order by r", method = "name__class")

giving:

  Person_ID Department      start        end
1    351581         GH 2019-12-01 2019-12-03
2    351581         FR 2019-12-02 2019-12-02
3    598168         GH 2019-12-16 2019-12-16
4    351581         JE 2019-12-08 2019-12-09
5    615418         AB 2019-12-20 2019-12-20

Note

The input is assumed to be:

Lines <- "Person_ID   Department   Date     
   351581      GH           12/1/2019
   351581      GH           12/2/2019
   351581      GH           12/3/2019
   351581      FR           12/2/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019"

DF0 <- read.table(text = Lines, header = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

Here I am checking whether the difference with the previous date (diff(Date)) is not 1. If so, start a new group (taking the cumsum of this indicator means g will increase by 1 whenever it's TRUE).

library(data.table)
setDT(df)

df[, Date := as.Date(Date, format = '%m/%d/%Y')]


df[, .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]

#    Person_ID Department g      start        end
# 1:    351581         GH 1 2019-12-01 2019-12-03
# 2:    351581         FR 2 2019-12-02 2019-12-02
# 3:    598168         GH 3 2019-12-16 2019-12-16
# 4:    351581         JE 4 2019-12-08 2019-12-09
# 5:    615418         AB 5 2019-12-20 2019-12-20
# 6:    615418         AB 6 2019-12-22 2019-12-22

If your data is not already ordered by date within (Person_ID, Department) groups, you can add order(Date) to the i part of df[i, j, k] i.e. change the code above to

df[order(Date), .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]

Note that for this updated example, this is not the same as grouping by Person_ID and Department

df[, .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department)]

#    Person_ID Department      start        end
# 1:    351581         GH 2019-12-01 2019-12-03
# 2:    351581         FR 2019-12-02 2019-12-02
# 3:    598168         GH 2019-12-16 2019-12-16
# 4:    351581         JE 2019-12-08 2019-12-09
# 5:    615418         AB 2019-12-20 2019-12-22

Data used:

df <- fread('
   Person_ID   Department   Date     
   351581      GH           12/1/2019
   351581      GH           12/2/2019
   351581      GH           12/3/2019
   351581      FR           12/2/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019
  615418      AB           12/22/2019
')
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • Looks like this works! Does it rely on the data being sorted, or is that handled by the grouping? – ginopono Jan 07 '20 at 15:48
  • Yes, the original answer does rely on dates within each group to be sorted already. I've added an edit for what to do if the dates aren't already ordered. – IceCreamToucan Jan 07 '20 at 15:56
  • Looks like that broke it, but it works if I do ``` df[order(Person_ID, Department, Date), .(start = min(Date), end = max(Date)), by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))] ``` – ginopono Jan 07 '20 at 16:28
1

Provided you've already filtered out the data with gaps, this looks to me a pretty clean solution. Is it hat you're looking for?


require(dplyr)

df <- tibble::tribble(~`Person ID`, ~`Department`,    ~`Date`,
                      "351581"    ,          "GH", as.Date("12/1/2019", format = "%m/%d/%y"),
                      "351581"    ,          "GH", as.Date("12/2/2019", format = "%m/%d/%y"),
                      "351581"    ,          "GH", as.Date("12/3/2019", format = "%m/%d/%y"),
                      "351581"    ,          "FR", as.Date("12/2/2019", format = "%m/%d/%y"),
                      "598168"    ,          "GH", as.Date("12/16/2019", format = "%m/%d/%y"),
                      "351581"    ,          "JE", as.Date("12/8/2019", format = "%m/%d/%y"),
                      "351581"    ,          "JE", as.Date("12/9/2019", format = "%m/%d/%y"),
                      "615418"    ,          "AB", as.Date("12/20/2019", format = "%m/%d/%y"))

df %>%
  group_by(`Person ID`, Department) %>%
  summarise(`Start Date` = min(Date),
            `End Date` = max(Date)) %>% 
  ungroup()

#> # A tibble: 5 x 4
#>   `Person ID` Department `Start Date` `End Date`
#>   <chr>       <chr>      <date>       <date>    
#> 1 351581      FR         2020-12-02   2020-12-02
#> 2 351581      GH         2020-12-01   2020-12-03
#> 3 351581      JE         2020-12-08   2020-12-09
#> 4 598168      GH         2020-12-16   2020-12-16
#> 5 615418      AB         2020-12-20   2020-12-20

Edo
  • 7,567
  • 2
  • 9
  • 19
  • It looks to me like this would include gaps within the start and end dates, yes? I've edited the example to contain a case of a gap. – ginopono Jan 07 '20 at 15:11
  • indeed. Look at my premises. I specifically said "Provided you've already filtered out the data with gaps". Based on your previous example, I thought you knew how to handle it – Edo Jan 07 '20 at 17:43
1

Use dplyr

Assuming you have data on a data.frame, you can achieve your result grouping by Pearson_id and Department:

library(dplyr)
data %>%
  group_by(`Person ID`, Department) %>%
  summarise(`Start Date` = min(as.Date(Date, format = "%m/%d/%Y")), 
            `End Date` = max(as.Date(Date, format = "%m/%d/%Y")))

Output will be:

# A tibble: 5 x 4
# Groups:   Person_id [3]
  Person ID Department `Start Date` `End Date`
      <int> <fct>      <date>       <date>    
1    351581 FR         2019-12-02   2019-12-02
2    351581 GH         2019-12-01   2019-12-03
3    351581 JE         2019-12-08   2019-12-09
4    598168 GH         2019-12-16   2019-12-16
5    615418 AB         2019-12-20   2019-12-20

Hope this help.

Louis
  • 3,592
  • 2
  • 10
  • 18
0

Here is a base R solution

dfout <- do.call(rbind,
                 c(lapply(split(df,cut(1:nrow(df),c(0,cumsum(rle(df$Department)$lengths)))), 
                          function(x) data.frame(unique(x[-3]),
                                                 `Start Date` = head(x[,3],1),
                                                 `End Date` = tail(x[,3],1))),
                   make.row.names = F)
                 )

such that

> dfout
  Person.ID Department Start.Date   End.Date
1    351581         GH  12/1/2019  12/3/2019
2    351581         FR  12/2/2019  12/2/2019
3    598168         GH 12/16/2019 12/16/2019
4    351581         JE  12/8/2019  12/9/2019
5    615418         AB 12/20/2019 12/20/2019
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81