3

I have a set of observations, which are recorded every time a user has taken an action. I want to filter only those observations from a user which are six or more months apart.

So, if a user has taken this action on "2018-01-01", "2018-03-01" and "2018-07-01", I only want to keep only "2018-01-01" and "2018-07-01".

Similarly, if a user has taken an action on "2018-01-01", "2018-03-01", "2018-07-01" and "2019-03-01" I want to keep only "2018-01-01", "2018-07-01", "2019-03-01".

So far, I have produced long and unworkable code.

# What I want to achieve
library(data.table)

dataIhave <- data.table(id    = c(1, 1, 1, 1, 2, 2, 3, 4), 
                        dates = c("2018-01-01", 
                                  "2018-03-01",
                                  "2018-07-01",
                                  "2019-01-01",
                                  "2018-01-03", 
                                  "2018-07-02", 
                                  "2018-02-01",
                                  "2018-02-01"))

dataIwant <- data.table(id    = c(1, 1, 1, 2, 3, 4), 
                        dates = c("2018-01-01", 
                                  "2018-07-01",
                                  "2019-01-01",
                                  "2018-01-01", 
                                  "2018-02-01",
                                  "2018-02-01"))

5 Answers5

6

This is a rolling-join variant of @Uwe's answer:

library(lubridate)
dataIhave[, dates := as.IDate(dates)]

ids = unique(dataIhave$id)

dataIhave[, seq := NA_integer_]
s = 1L
w = dataIhave[.(ids), on=.(id), mult="first", which = TRUE]
dataIhave[w, seq := s]
while (TRUE){
  w = dataIhave[
    dataIhave[w, .(id, dates = dates %m+% months(6))], 
    on = .(id, dates), roll = -Inf, nomatch = 0, which = TRUE
  ]

  if (!length(w)) break
  s = s + 1L
  dataIhave[w, seq := s]
}

dataIhave[!is.na(seq)]

   id      dates seq
1:  1 2018-01-01   1
2:  1 2018-07-01   2
3:  1 2019-01-01   3
4:  2 2018-01-03   1
5:  3 2018-02-01   1
6:  4 2018-02-01   1

The loop takes rows w defined per id, steps their dates forward six months, and takes the next row found, if any. The arguments to the join are:

  • The tables, with join syntax x[i, ...]
    • x = dataIhave
    • i = dataIhave[w, .(id, dates = dates %m+% months(6))]
  • on = .(id, date): columns to match by
  • roll = -Inf: find the next match on the last column in on=
  • nomatch = 0: if no match is found, skip
  • which = TRUE: return matched row number

Additionally, if there are duplicate dates (see the second example in @Uwe's post):

  • mult = "first": take only the first match for each row of i

In selecting the first row by id before the loop, I'm assuming the data is sorted by dates within id (so I'm not using order as @Uwe's answer does).

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thank you for the recursive rolling join version. However, in order to deal with duplicate dates as in my 2nd test case, the rolling join requires the `mult="first"` parameter as well. – Uwe Jun 23 '19 at 06:50
3

If I understand correctly, the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period (separately for each id).

I have no idea how this can be accomplished by a non-recursive rolling or non-equi join as there is no fixed grid of dates. So, I believe it requires a kind of recursive approach, somehow. Here is one possibility:

library(data.table)
library(lubridate)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := TRUE]
dataIhave[order(id, dates)
  , keep := {
    start <- dates[1L]
    for (i in tail(seq_along(dates), -1L)) {
      if (dates[i] < start %m+% months(6)) {
        keep[i] <- FALSE
      } else {
        start <- dates[i]
      }
    }
    keep
  }, by = id][]
   id      dates  keep
1:  1 2018-01-01  TRUE
2:  1 2018-03-01 FALSE
3:  1 2018-07-01  TRUE
4:  1 2019-01-01  TRUE
5:  2 2018-01-03  TRUE
6:  2 2018-07-02 FALSE
7:  3 2018-02-01  TRUE
8:  4 2018-02-01  TRUE

Finally,

dataIhave[(keep), -"keep"]
   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

2nd test case

The crucial point here is to detect the beginning of a new period (within each id).

As an additional test case I have added two dates to id == 1, 2018-07-01 and 2018-07-02.
2018-07-01 is a duplicate. Both dates should be removed as they lie within the second 6-months period starting at 2018-07-01.

dataIhave <- fread("
 id      dates
  1 2018-01-01
  1 2018-03-01
  1 2018-07-01
  1 2018-07-01
  1 2018-07-02
  1 2019-01-01
  2 2018-01-03
  2 2018-07-02
  3 2018-02-01
  4 2018-02-01")

Indeed, the code above returns the same output as with OP's original test case.

Remove rows only within the first six months for each id

If the question is interpreted to only remove entries within the first 6-months period for each id and keep all the date after 6 months this can be achieved by

dataIhave[!dataIhave[, .I[dates < dates[1L] %m+% months(6L)][-1L], by = id]$V1]

which returns

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2018-07-01
4:  1 2018-07-02
5:  1 2019-01-01
6:  2 2018-01-03
7:  3 2018-02-01
8:  4 2018-02-01

for the second test case. (Note that this is a streamlined version of Jaap's answer.)

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
2

Another variant:

library(lubridate)
library(data.table)

dataIhave[, dates := as.Date(dates)]

dataIhave[, keep := dates >= dates[1] %m+% months(6), by = id
          ][dataIhave[, .I[1], by = id][[2]], keep := TRUE
            ][!!keep, -"keep"]

which gives:

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Your answer removes only dates within the _first_ 6 months for each `id`. So, your code gives a different result for my 2nd test case. (I don't say this is wrong as it depends on the interpretation of the Q.) – Uwe Jun 23 '19 at 06:21
  • @Uwe I know, as you already noted: it's just another (simpler) interpretation of the question – Jaap Jun 23 '19 at 08:40
  • 1
    Congrats with a `data.table` badge :-) – pogibas Jun 27 '19 at 10:31
1

Using non-equi join and igraph to avoid implicit loops and recursion:

#data prep
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
setorder(dataIhave[, rn:=rowid(id)], id, dates)
dataIhave[, end := as.IDate(sapply(dates, 
    function(d) seq(d, by="6 months", length.out=2L)[2L]))]

#non-equi self join to find first date that is after 6months
nonequi <- dataIhave[dataIhave, on=.(id, dates>=end), mult="first", by=.EACHI,
    .(i.id, i.rn, x.rn, i.dates, x.dates)]

library(igraph)
nonequi[, {
        #create graph from the previous join
        g <- graph_from_data_frame(.SD[, .(i.rn, x.rn)])
        #plot(g)

        #find the leaf nodes
        leaf <- sapply(V(g), function(x) length(neighbors(g,x))==0L)

        #from the first date (i.e. node = V(g)["1"]), find the path starting from this date.
        path <- get.all.shortest.paths(g, V(g)["1"], leaf)$res

        #return all dates (i.e. nodes) in this path
        .(dates=i.dates[i.rn %in% na.omit(V(g)[path[[1L]]]$name)])
    },
    by=.(id=i.id)]

output:

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01

Or a recursive approach similar to Uwe's solution:

dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
unique(dataIhave[,
    .(dates=as.IDate(Reduce(
        function(x, y) if (y >= seq(x, by="6 months", length.out=2L)[2L]) y else x,
        dates,
        accumulate=TRUE))),
    .(id)])

output:

   id      dates
1:  1 2018-01-01
2:  1 2018-07-01
3:  1 2019-01-01
4:  2 2018-01-03
5:  3 2018-02-01
6:  4 2018-02-01
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0
library(lubridate)
library(data.table)

dataiHave[, dates := ymd(dates)]
dataiHave[, difDates := as.numeric(difftime(dates, units = "weeks"))]

dataIHave[difDates >= 24, .(id, dates)]

Does this produce the result you want?

Months have irregular durations so you'll have to stick to a time unit of fixed duration.

You can also check ?lubridate::interval, lubridate::as. duration and this question: Time difference in years with lubridate?

PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • 1
    I guess you can compare the date vs the date plus six months: https://stackoverflow.com/a/22646010 However, OP's rules might be underspecified. With events at months 1, 4, 8, 12, a valid "at least six months apart" selection could be (1, 8), (1, 12), (4, 12) and it's not clear which should be preferred. – Frank Jun 21 '19 at 23:55
  • 2
    @Frank I believe the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period. – Uwe Jun 22 '19 at 00:10