I have a dataframe with daily observations for individuals. In the example I reduce the dataset to 3 variables, the day and two IDs which identify an individual. An idividual is identified by two IDs but someday one of the IDs could be missing (and re-appear another day), but this is still the same idividual.
For example this is observations for the same idividual :
Day id1 id2
1 Day1 1 x
2 Day2 NA x
3 Day3 1 <NA>
Also, one can find the same individuals every day or they can appear/disappear from one day to the other. (For example the id1 = 4 & id2 = d bellow)
I would like to keep the rows where the couple id1, id2 change at least 1 time over days.
I don't know is this is clear enough so here is dummy datas :
datas <- data.frame(id1 = c(1, 2, 3, 1, 2, 3, 4, NA, 2, 3, 4),
id2 = c(NA, "b", "c", "a", "b", "c", "d", "a", NA, "c", "d"),
date = rep(seq(Sys.Date() - 2, Sys.Date(), by = "day"), each = 4)[-4])
id1 id2 date
1 1 <NA> 2015-08-02
2 2 b 2015-08-02
3 3 c 2015-08-02
4 1 a 2015-08-03
5 2 b 2015-08-03
6 3 c 2015-08-03
7 4 d 2015-08-03
8 NA a 2015-08-04
9 2 <NA> 2015-08-04
10 3 c 2015-08-04
11 4 d 2015-08-04
And the output I'm expecting :
id1 id2 date
1 1 <NA> 2015-08-02
2 2 b 2015-08-02
3 1 a 2015-08-03
4 2 b 2015-08-03
5 NA a 2015-08-04
6 2 <NA> 2015-08-04
I thought I got something with this piece of code :
datas.dt <- data.table(datas, key = c("id1", "id2"))
datas.dt[datas.dt[ , .N, by = c("id1", "id2")][N != length(unique(datas.dt$date))]]
id1 id2 date N
1: NA a 2015-08-04 1
2: 1 NA 2015-08-02 1
3: 1 a 2015-08-03 1
4: 2 NA 2015-08-04 1
5: 2 b 2015-08-02 2
6: 2 b 2015-08-03 2
7: 4 d 2015-08-03 2
8: 4 d 2015-08-04 2
#Or with dplyr
datas.tbl <- tbl_df(datas)
datas.tbl %>%
count(id1, id2) %>%
filter(n != length(unique(datas.tbl$date))) %>%
inner_join(datas.tbl, by = c("id1", "id2"))
But it fails if an individual have no observations for all the days in the dataset (the case id2 = d for example). And for now I have no idea how to achieve this. Any help would be highly appreciated. Thanks !