Filter rows where "grouping keys" change at least 1 time over time

Question

I have a dataframe with daily observations for individuals. In the example I reduce the dataset to 3 variables, the day and two IDs which identify an individual. An idividual is identified by two IDs but someday one of the IDs could be missing (and re-appear another day), but this is still the same idividual.

For example this is observations for the same idividual :

   Day id1  id2
1 Day1   1    x
2 Day2  NA    x
3 Day3   1 <NA>

Also, one can find the same individuals every day or they can appear/disappear from one day to the other. (For example the id1 = 4 & id2 = d bellow)

I would like to keep the rows where the couple id1, id2 change at least 1 time over days.

I don't know is this is clear enough so here is dummy datas :

datas <- data.frame(id1 = c(1, 2, 3, 1, 2, 3, 4, NA, 2, 3, 4), 
                id2 = c(NA, "b", "c", "a", "b", "c", "d", "a", NA, "c", "d"),
                date = rep(seq(Sys.Date() - 2, Sys.Date(), by = "day"), each = 4)[-4])

   id1  id2       date
1    1 <NA> 2015-08-02
2    2    b 2015-08-02
3    3    c 2015-08-02
4    1    a 2015-08-03
5    2    b 2015-08-03
6    3    c 2015-08-03
7    4    d 2015-08-03
8   NA    a 2015-08-04
9    2 <NA> 2015-08-04
10   3    c 2015-08-04
11   4    d 2015-08-04

And the output I'm expecting :

    id1  id2       date
 1    1 <NA> 2015-08-02
 2    2    b 2015-08-02
 3    1    a 2015-08-03
 4    2    b 2015-08-03
 5   NA    a 2015-08-04
 6    2 <NA> 2015-08-04

I thought I got something with this piece of code :

datas.dt <- data.table(datas, key = c("id1", "id2"))

datas.dt[datas.dt[ , .N, by = c("id1", "id2")][N != length(unique(datas.dt$date))]]

   id1 id2       date N
1:  NA   a 2015-08-04 1
2:   1  NA 2015-08-02 1
3:   1   a 2015-08-03 1
4:   2  NA 2015-08-04 1
5:   2   b 2015-08-02 2
6:   2   b 2015-08-03 2
7:   4   d 2015-08-03 2
8:   4   d 2015-08-04 2

#Or with dplyr

datas.tbl <- tbl_df(datas)

datas.tbl %>% 
  count(id1, id2) %>% 
  filter(n != length(unique(datas.tbl$date))) %>% 
  inner_join(datas.tbl, by = c("id1", "id2"))

But it fails if an individual have no observations for all the days in the dataset (the case id2 = d for example). And for now I have no idea how to achieve this. Any help would be highly appreciated. Thanks !

To clarify: the two id columns are redundant, right? It's not that they are jointly needed to determine identity? — Frank, Aug 04 '15 at 14:17

grrgrrbla · Answer 1 · 2015-08-04T10:22:38.373

4

Is this what you want?

filter_id1 <- setDT(datas)[ , length(unique(id2)), by = id1][ V1 > 1, id1]
filter_id2 <- datas[ , length(unique(id1)), by = id2][ V1 > 1, id2]

datas[ id1 %in% filter_id1 | id2 %in% filter_id2]

output:

   id1 id2       date
1:   1  NA 2015-08-02
2:   2   b 2015-08-02
3:   1   a 2015-08-03
4:   2   b 2015-08-03
5:  NA   a 2015-08-04
6:   2  NA 2015-08-04

edited Aug 04 '15 at 10:22

answered Aug 04 '15 at 10:12

grrgrrbla

2,529
2
16
29

1

Thanks for your answer. This work in this example though I have unexpected behaviour on my real datas, I could have miss something. Anyway you give my key to find a solution : `tmp <- x[, .N, by = c("id1", "id2")] filter_id1 <- tmp[, .N, by = id1][N > 1 & !is.na(id1), id1] filter_id2 <- tmp[, .N, by = id2][N > 1 & !is.na(id2), id2] x[id1 %in% filter_id1 | id2 %in% filter_id2]` – Julien Navarre Aug 04 '15 at 14:58
@JulienNavarre If you're posting code in comments for others' benefit, you can put semicolons at the end of each line so that it's copy-pasteable – Frank Aug 04 '15 at 15:00
happy to have helped you find the solution :), It's always better to find it yourself with some help; the problem in extending my solution to your dataset as a whole is probably in the NA-values which I didnt exclude – grrgrrbla Aug 04 '15 at 15:38

score 4 · Answer 2 · edited May 23 '17 at 11:58

First, clean up the ids by creating a new one:

idmap = unique( na.omit(datas)[,.(id1,id2)] )[,id0 := .I]

datas[idmap, id0 := i.id0, on="id1"]
datas[idmap, id0 := i.id0, on="id2"]

Then select the groups:

datas[
  sort( datas[, .I[ uniqueN(.SD) > 1 ], by=id0, .SDcols=c("id1","id2")]$V1 )
]

This approach is borrowed from @eddi.

The row-number object .I is key here. The column it generates is named V1 by default. After we have these row numbers, we can subset datas in the usual way, datas[row_numbers].

Filter rows where "grouping keys" change at least 1 time over time

2 Answers2