Filtering observations based on specific date condition using data.table

Question

I have a set of observations, which are recorded every time a user has taken an action. I want to filter only those observations from a user which are six or more months apart.

So, if a user has taken this action on "2018-01-01", "2018-03-01" and "2018-07-01", I only want to keep only "2018-01-01" and "2018-07-01".

Similarly, if a user has taken an action on "2018-01-01", "2018-03-01", "2018-07-01" and "2019-03-01" I want to keep only "2018-01-01", "2018-07-01", "2019-03-01".

So far, I have produced long and unworkable code.

# What I want to achieve
library(data.table)

dataIhave <- data.table(id    = c(1, 1, 1, 1, 2, 2, 3, 4), 
                        dates = c("2018-01-01", 
                                  "2018-03-01",
                                  "2018-07-01",
                                  "2019-01-01",
                                  "2018-01-03", 
                                  "2018-07-02", 
                                  "2018-02-01",
                                  "2018-02-01"))

dataIwant <- data.table(id    = c(1, 1, 1, 2, 3, 4), 
                        dates = c("2018-01-01", 
                                  "2018-07-01",
                                  "2019-01-01",
                                  "2018-01-01", 
                                  "2018-02-01",
                                  "2018-02-01"))

for id=2, do you mean 2018-01-03 instead of 2018-01-01? – chinsoon12 Jun 24 '19 at 01:22 — chinsoon12, Jun 24 '19 at 01:22

Frank · Answer 1 · 2019-06-23T13:30:22.607

This is a rolling-join variant of @Uwe's answer:

library(lubridate)
dataIhave[, dates := as.IDate(dates)]

ids = unique(dataIhave$id)

dataIhave[, seq := NA_integer_]
s = 1L
w = dataIhave[.(ids), on=.(id), mult="first", which = TRUE]
dataIhave[w, seq := s]
while (TRUE){
  w = dataIhave[
    dataIhave[w, .(id, dates = dates %m+% months(6))], 
    on = .(id, dates), roll = -Inf, nomatch = 0, which = TRUE
  ]

  if (!length(w)) break
  s = s + 1L
  dataIhave[w, seq := s]
}

dataIhave[!is.na(seq)]

   id      dates seq
1:  1 2018-01-01   1
2:  1 2018-07-01   2
3:  1 2019-01-01   3
4:  2 2018-01-03   1
5:  3 2018-02-01   1
6:  4 2018-02-01   1

The loop takes rows w defined per id, steps their dates forward six months, and takes the next row found, if any. The arguments to the join are:

The tables, with join syntax x[i, ...]
- x = dataIhave
- i = dataIhave[w, .(id, dates = dates %m+% months(6))]
on = .(id, date): columns to match by
roll = -Inf: find the next match on the last column in on=
nomatch = 0: if no match is found, skip
which = TRUE: return matched row number

Additionally, if there are duplicate dates (see the second example in @Uwe's post):

mult = "first": take only the first match for each row of i

In selecting the first row by id before the loop, I'm assuming the data is sorted by dates within id (so I'm not using order as @Uwe's answer does).

Thank you for the recursive rolling join version. However, in order to deal with duplicate dates as in my 2nd test case, the rolling join requires the `mult="first"` parameter as well. — Uwe, Jun 23 '19 at 06:50

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

If I understand correctly, the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period (separately for each id).

I have no idea how this can be accomplished by a non-recursive rolling or non-equi join as there is no fixed grid of dates. So, I believe it requires a kind of recursive approach, somehow. Here is one possibility:

library(data.table)
library(lubridate)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := TRUE]
dataIhave[order(id, dates)
  , keep := {
    start <- dates[1L]
    for (i in tail(seq_along(dates), -1L)) {
      if (dates[i] < start %m+% months(6)) {
        keep[i] <- FALSE
      } else {
        start <- dates[i]
      }
    }
    keep
  }, by = id][]

   id      dates  keep
1:  1 2018-01-01  TRUE
2:  1 2018-03-01 FALSE
3:  1 2018-07-01  TRUE
4:  1 2019-01-01  TRUE
5:  2 2018-01-03  TRUE
6:  2 2018-07-02 FALSE
7:  3 2018-02-01  TRUE
8:  4 2018-02-01  TRUE

Finally,

dataIhave[(keep), -"keep"]

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

2nd test case

The crucial point here is to detect the beginning of a new period (within each id).

As an additional test case I have added two dates to id == 1, 2018-07-01 and 2018-07-02.
2018-07-01 is a duplicate. Both dates should be removed as they lie within the second 6-months period starting at 2018-07-01.

dataIhave <- fread("
 id      dates
  1 2018-01-01
  1 2018-03-01
  1 2018-07-01
  1 2018-07-01
  1 2018-07-02
  1 2019-01-01
  2 2018-01-03
  2 2018-07-02
  3 2018-02-01
  4 2018-02-01")

Indeed, the code above returns the same output as with OP's original test case.

Remove rows only within the first six months for each `id`

If the question is interpreted to only remove entries within the first 6-months period for each id and keep all the date after 6 months this can be achieved by

dataIhave[!dataIhave[, .I[dates < dates[1L] %m+% months(6L)][-1L], by = id]$V1]

which returns

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2018-07-01
4:  1 2018-07-02
5:  1 2019-01-01
6:  2 2018-01-03
7:  3 2018-02-01
8:  4 2018-02-01

for the second test case. (Note that this is a streamlined version of Jaap's answer.)

score 2 · Answer 3 · answered Jun 22 '19 at 07:02

2

Another variant:

library(lubridate)
library(data.table)

dataIhave[, dates := as.Date(dates)]

dataIhave[, keep := dates >= dates[1] %m+% months(6), by = id
          ][dataIhave[, .I[1], by = id][[2]], keep := TRUE
            ][!!keep, -"keep"]

which gives:

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

answered Jun 22 '19 at 07:02

Jaap

81,064
34
182
193

Your answer removes only dates within the _first_ 6 months for each `id`. So, your code gives a different result for my 2nd test case. (I don't say this is wrong as it depends on the interpretation of the Q.) – Uwe Jun 23 '19 at 06:21
@Uwe I know, as you already noted: it's just another (simpler) interpretation of the question – Jaap Jun 23 '19 at 08:40
1

Congrats with a `data.table` badge :-) – pogibas Jun 27 '19 at 10:31

chinsoon12 · Answer 4 · 2019-06-27T01:47:31.097

Using non-equi join and igraph to avoid implicit loops and recursion:

#data prep
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
setorder(dataIhave[, rn:=rowid(id)], id, dates)
dataIhave[, end := as.IDate(sapply(dates, 
    function(d) seq(d, by="6 months", length.out=2L)[2L]))]

#non-equi self join to find first date that is after 6months
nonequi <- dataIhave[dataIhave, on=.(id, dates>=end), mult="first", by=.EACHI,
    .(i.id, i.rn, x.rn, i.dates, x.dates)]

library(igraph)
nonequi[, {
        #create graph from the previous join
        g <- graph_from_data_frame(.SD[, .(i.rn, x.rn)])
        #plot(g)

        #find the leaf nodes
        leaf <- sapply(V(g), function(x) length(neighbors(g,x))==0L)

        #from the first date (i.e. node = V(g)["1"]), find the path starting from this date.
        path <- get.all.shortest.paths(g, V(g)["1"], leaf)$res

        #return all dates (i.e. nodes) in this path
        .(dates=i.dates[i.rn %in% na.omit(V(g)[path[[1L]]]$name)])
    },
    by=.(id=i.id)]

output:

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

Or a recursive approach similar to Uwe's solution:

dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
unique(dataIhave[,
    .(dates=as.IDate(Reduce(
        function(x, y) if (y >= seq(x, by="6 months", length.out=2L)[2L]) y else x,
        dates,
        accumulate=TRUE))),
    .(id)])

output:

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

PavoDive · Answer 5 · 2019-06-21T23:14:23.157

0

library(lubridate)
library(data.table)

dataiHave[, dates := ymd(dates)]
dataiHave[, difDates := as.numeric(difftime(dates, units = "weeks"))]

dataIHave[difDates >= 24, .(id, dates)]

Does this produce the result you want?

Months have irregular durations so you'll have to stick to a time unit of fixed duration.

You can also check ?lubridate::interval, lubridate::as. duration and this question: Time difference in years with lubridate?

edited Jun 21 '19 at 23:14

answered Jun 21 '19 at 22:59

PavoDive

6,322
2
29
55

1

I guess you can compare the date vs the date plus six months: https://stackoverflow.com/a/22646010 However, OP's rules might be underspecified. With events at months 1, 4, 8, 12, a valid "at least six months apart" selection could be (1, 8), (1, 12), (4, 12) and it's not clear which should be preferred. – Frank Jun 21 '19 at 23:55
2

@Frank I believe the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period. – Uwe Jun 22 '19 at 00:10

Filtering observations based on specific date condition using data.table

5 Answers5

2nd test case

Remove rows only within the first six months for each id

Remove rows only within the first six months for each `id`