2

Suppose that I have two tables (DT_sportA and DT_sportB) that measure time periods in which two children (id) played sport "A" and "B".

library(data.table)
library(lubridate)

DT_sportA <- data.table(id = rep(1:2,each=2),
                start_date = ymd(c("2000-01-01","2002-01-15","2014-03-12","2016-10-14")),
                end_date = ymd(c("2000-02-03","2003-03-01","2014-04-03","2017-05-19")))
DT_sportA
#    id start_date   end_date
# 1:  1 2000-01-01 2000-02-03
# 2:  1 2002-01-15 2003-03-01
# 3:  2 2014-03-12 2014-04-03
# 4:  2 2016-10-14 2017-05-19


DT_sportB <- data.table(id = c(1L,1L,2L),
                        start_date = ymd(c("2000-01-15","2002-01-15","2017-02-10")),
                        end_date = ymd(c("2000-02-01","2006-03-19","2017-02-20")))

DT_sportB
#    id start_date   end_date
# 1:  1 2000-01-15 2000-02-01
# 2:  1 2002-01-15 2006-03-19
# 3:  2 2017-02-10 2017-02-20

I would like to generate a new table with all of the unique and overlapping date ranges with two categorical indicators denoting the sport played by the children. The desired DT should look like this:

   id start_date   end_date sportA sportB
1:  1 2000-01-01 2000-01-14      1      0
2:  1 2000-01-15 2000-02-01      1      1
3:  1 2000-02-02 2000-02-03      1      0
4:  1 2002-01-15 2002-03-01      1      1
5:  1 2002-03-02 2002-03-19      0      1
6:  2 2014-03-12 2014-04-03      1      0
7:  2 2016-10-14 2017-02-09      1      0
8:  2 2017-02-10 2017-02-20      1      1
9:  2 2017-02-21 2017-05-19      1      0

This is a fairly trivial toy example. The real data spans several million rows and approximately 20 "sports", which is why I am looking for a data.table solution.

r2evans
  • 141,215
  • 6
  • 77
  • 149
user3102806
  • 141
  • 6

1 Answers1

3

Notes:

  • when doing similar/same things to multiple tables, I find it is almost always preferable to operate on them as a list of tables instead of individual objects; while this solution will work in general without this (some adaptation required), I believe it makes many things worth the mind-shift;

  • further, I actually think a long-format is better than a list-of-tables here, as we can still differentiate id and sport with ease;

  • your expected output is a little inconsistent in how it avoids overlap between rows; for example, "2000-01-14" is not in the data, but it is the end_date, suggesting that "2000-01-15" was reduced because the next row starts on that date ... but there is a start on "2000-02-02" for apparently similar (but reversed) reasons; one way around this is to subtract a really low number from end_date so that no id/sport/date range will match multiple rows, and I say "low number" and not 1 because Date-class objects are really numeric, and dates can be fractional: though not displayed fractionally, it is still fractional, compare Sys.Date()-0.1 with dput(Sys.Date()-0.1).

sports <- rbindlist(mget(ls(pattern = "DT_sport.*")), idcol = "sport")
sports[, sport := gsub("^DT_", "", sport) ] # primarily aesthetics
#     sport    id start_date   end_date
#    <char> <int>     <Date>     <Date>
# 1: sportA     1 2000-01-01 2000-02-03
# 2: sportA     1 2002-01-15 2003-03-01
# 3: sportA     2 2014-03-12 2014-04-03
# 4: sportA     2 2016-10-14 2017-05-19
# 5: sportB     1 2000-01-15 2000-02-01
# 6: sportB     1 2002-01-15 2006-03-19
# 7: sportB     2 2017-02-10 2017-02-20

I tend to like piping data.table, and since I'm still on R-4.0.5, I use magrittr::%>% for this; it is not strictly required, but I feel it helps readability (and therefore maintainability, etc). (I don't know if this will work as easily in R-4.1's native |> pipe, as that has more restrictions on the RHS data placement.)

library(magrittr)
out <- sports[, {
  vec <- sort(unique(c(start_date, end_date)));
  .(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
  .[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
  sports[., on = .(id, start_date <= sd, end_date >= ed) ] %>%
  .[ !is.na(sport), ] %>%
  .[, val := 1L ] %>%
  dcast(id + start_date + end_date ~ sport, value.var = "val", fill = 0)
out
#       id start_date   end_date sportA sportB
#    <int>     <Date>     <Date>  <int>  <int>
# 1:     1 2000-01-01 2000-01-14      1      0
# 2:     1 2000-01-15 2000-01-31      1      1
# 3:     1 2000-02-01 2000-02-02      1      0
# 4:     1 2002-01-15 2003-02-28      1      1
# 5:     1 2003-03-01 2006-03-19      0      1
# 6:     2 2014-03-12 2014-04-02      1      0
# 7:     2 2016-10-14 2017-02-09      1      0
# 8:     2 2017-02-10 2017-02-19      1      1
# 9:     2 2017-02-20 2017-05-19      1      0

Walk-through:

  • the first sports[, {...}] produces just the feasible date-ranges, per-id; it will produce more than needed, and these are filtered out a little later; I combine this with a slight offset to end_date so that rows are mutually exclusive (second note above); while they appear to be full-days separated, they are only separated by under 1 second; I add secdiff to show this here:

    sports[, {
      vec <- sort(unique(c(start_date, end_date)));
      .(sd = vec[-length(vec)], ed = vec[-1]);
    }, by = .(id) ] %>%
      .[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
      .[, secdiff := c(as.numeric(sd[-1] - ed[-.N], units="secs"), NA), by = .(id) ]
    #        id         sd         ed   secdiff
    #     <int>     <Date>     <Date>     <num>
    #  1:     1 2000-01-01 2000-01-14 0.8640000
    #  2:     1 2000-01-15 2000-01-31 0.8640000
    #  3:     1 2000-02-01 2000-02-02 0.8640000
    #  4:     1 2000-02-03 2002-01-14 0.8640000  # will be empty post-join
    #  5:     1 2002-01-15 2003-02-28 0.8640000
    #  6:     1 2003-03-01 2006-03-19        NA
    #  7:     2 2014-03-12 2014-04-02 0.8640001
    #  8:     2 2014-04-03 2016-10-13 0.8640001  # will be empty post-join
    #  9:     2 2016-10-14 2017-02-09 0.8640001
    # 10:     2 2017-02-10 2017-02-19 0.8640001
    # 11:     2 2017-02-20 2017-05-19        NA
    
  • btw, the first operation on sports[..] in the previous bullet is {-blockized for a slight boost in efficiency, choosing to not sort(unique(c(start_date, end_date))) twice;

  • left join sports onto this, on id and the date-ranges; this will produce NA values in the sport column, which indicates the date ranges that were programmatically made (with a simple sequence of dates) but no sports are assigned; these not-needed rows are removed by the !is.na(sport);

  • assigning val := 1L is purely so that we have a value column during reshaping;

  • dcast reshapes and fills the missing values with 0.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Many thanks for this excellent solution and walk-through, although I must admit that it is slightly above my pay grade. My question is whether it would be possible to "correct the date display" (or add back what was removed earlier) as a final step? The idea is to later predict, let's say, "injury rates" during periods in which the individuals played different sports. This will involve additional non-equi joins and it's therefore important for the overlapping dates (on the day-level) to be correct in that step. This might perhaps not be an issue? – user3102806 Oct 25 '21 at 14:22
  • Your sample data is inconsistent, so I'm not sure what you really mean. If you mean you want the `end_date` to be restored to the original values (and therefore have truly-overlapping `end`/`start` between rows), then ... sure, just test if it is within `1.1e-5` of the next `start_date` and add `1e-5` back to it if true. – r2evans Oct 25 '21 at 14:25