Conditionally remove duplicate elements from a data.table

Question

I have merged two datasets. Set X contains an ID and a startDate, set Y an ID and an endDate. Both have the ID as key. I want to match every startDate with its corresponding endDate (provided it is bigger than the startDate). The problem is that IDs may appear multiple times in a given set, but not always as many times in the same set. Hence, a startDate can get matched up with multiple entries of endDate, and vice versa. This is basically the code I have:

require(data.table)

startDate = c(1,2,3,1)
IDX = c(1,2,2,3)
endDate = c(2,3,4)
IDY = c(1,1,2)

X = data.table(startDate,IDX)
Y= data.table(endDate,IDY)

setkey(X,IDX)
setkey(Y,IDY)
DT = X[Y,allow.cartesian = TRUE][endDate>startDate]

Now I would like to conditionally remove duplicated entries from the set DT. Corresponding to the keys "ID" and "startDate" I want to only keep the duplicated entry with the lowest endDate. For every duplicated entry of "ID" and "endDate" I want to do the same, but instead keep the entry that has the highest startDate.

DT looks like this:

   IDX startDate endDate
1:   1         1       2
2:   1         1       3
3:   2         2       4
4:   2         3       4

There are two duplicates in this new data table. 1 and 2 are duplicates of eachother, with a different endDate. Only entry 1 has to stay (since it has the closest endDate to the startDate). 3 and 4 are also duplicates of eachother, with a different startDate. Here, entry 4 has to stay (since it has the closes startDate to the endDate). So the required output is

   IDX startDate endDate
1:   1         1       2
2:   2         3       4

I don't know how to achieve this with duplicated(DT), since it does not necessarily give the specific duplicate that I want. Anyone got a suggestion? Moreover, can I perhaps already solve this during the "join" instead of solving it afterwards?

I am not sure if I understood your question right: What would be the correct outcome of `data.table( IDX = c(1,1,1), startDate = c(2,2,1), endDate = c(1,2,2) )`? — Beasterfield, Apr 01 '14 at 15:41
@Arun this is the simplest example that covers all the cases I want. I updated the output. — Forzaa, Apr 02 '14 at 11:39

eddi · Accepted Answer · 2014-04-02T19:05:23.507

2

I don't think this can be done in a single join, but perhaps the following can work:

Y[, startDate := endDate]
setkey(X, IDX, startDate)
setkey(Y, IDY, startDate)

Y[X, roll = -Inf][, list(startDate = startDate[.N]), by = list(IDY, endDate)]
#   IDY endDate startDate
#1:   1       2         1
#2:   2       4         3

edited Apr 02 '14 at 19:05

answered Apr 01 '14 at 16:28

eddi

49,088
6
104
155

I show the expected output now so that the question is more clear (I hope). In practice X and Y don't have equal dimensions, btw. – Forzaa Apr 02 '14 at 11:51
@Forzaa ok, see if the edited version is what you're looking for – eddi Apr 02 '14 at 19:05
2

+1 Awesome use of `roll`. You may want to add `nomatch=0L` to ensure that there are no `NA` results when there are no `endDate > startDate` for a particular ID. – Arun Apr 02 '14 at 22:14
@Arun thanks, that's a good point - OP can decide what they prefer to do in that situation – eddi Apr 02 '14 at 22:29
@eddi, yes that point was more or less for the OP :). Great answer, once again! I think it's also equivalent to `Y[X[J(unique(IDX)), mult="last"], roll=-Inf]` (or from 1.9.3+ also `Y[unique(X, by="IDX", fromLast=TRUE), roll=-Inf]` – Arun Apr 02 '14 at 22:37
@eddi Thanks for the remarkable answer. Together with the `nomatch=0L` from @Arun it works well. There is a remaining problem though: in the real dataset there are additional columns, all of which are being removed during the rolling join. How can I get all of the remaining columns to enter the joined dataset? – Forzaa Apr 03 '14 at 09:36
@eddi moreover, how can I force the roll to take "endDate > startDate" rather than "endDate >= startDate"? – Forzaa Apr 03 '14 at 12:35
@Forzaa for your first question, use `.SD[.N]` instead of `startDate[.N]` and if that's slow, see [this question](http://stackoverflow.com/q/16573995/817778); as for your second question, just shift either startDate or endDate by a little bit, so that they never match exactly (make sure both are numeric to avoid any weirdness when joining integers and numerics) – eddi Apr 03 '14 at 15:17
@eddi Thanks. Accepted your answer now :) – Forzaa Apr 04 '14 at 11:08

Conditionally remove duplicate elements from a data.table

1 Answers1