I have time series data grouped by subject ('id'), which stay on certain 'site' and have a certain 'stage' in each 'time' step.
Sometimes subjects switch from one site to another, and possibly back again. If individuals switch site back and forth (e.g. from site 'a' to site 'b', and then back to site 'a') and if there is only one registration on the middle site (in a transition a-b-a, then site 'b' would here be considered a 'middle site') and the individual is in a certain stage (here, stage = 2) at the middle site, then I wish to remove the registration at this time step.
My dummy data consists of four subjects. Three of them (subject 1-3) have moved from site a to b, and then back to site b, and one has moved from a to b.
The first two subjects both have a single registration on the middle site. Subject 1 is in stage 1 on the middle site and I wish to keep that registration. Subject 2 on the other hand is in stage 2 on the middle site and this registration should be removed. Subject 3, has also moved back and forth between a and b. However, although it is in stage 2 on the middle site b, it has two registrations there and both registrations are kept. Subject 4 has moved from site a to b, but not back again. Thus, although it is in stage 2 on site b, the registration on site b is not a 'middle site' and should be kept.
The data:
df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
.Names = c("id", "time", "site", "stage"),
row.names = c(NA, -12L), class = "data.frame")
df
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1 <~~ A single middle registration on site 2
# 3 1 3 a 1 However, the individual is in stage 1: -> keep
# 4 2 1 a 1
# 5 2 2 b 2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6 2 3 a 1
# 7 3 1 a 1
# 8 3 2 b 2 <~~ Two middle registrations with stage 2: -> keep both rows
# 9 3 3 b 2 <~~
# 10 3 4 a 1
# 11 4 1 a 1
# 12 4 2 b 2 <~~ A single registration on site 2 with stage 2,
# but it is not in between two sites: -> keep
Thus, in the test data, it is only the registration at time = 2 for id = 2 which should be removed.
Previously, I have used plyr::ddply
and result from rle
to solve the problem:
For each individual, calculate run lengths of site (rle(x$site)
)
If:
- back and forth between sites (e.g. from a to b, and back to a)
(length(r$values) > 2
) &
- only one registration on middle site (r$lengths[2] == 1
) &
- stage on middle site is 2 (x$stage[x$site == r$values[2]][1] == 2
)
Then: remove registration on middle site x[!(x$site == r$values[2]), ]
)
library(plyr)
ddply(df, .(id), function(x){
r <- rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
x[x$site != r$values[2], ]
} else x
})
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1
# 3 1 3 a 1
# 4 2 1 a 1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5 2 3 a 1 <~~
# 6 3 1 a 1
# 7 3 2 b 2
# 8 3 3 b 2
# 9 3 4 a 1
# 10 4 1 a 1
# 11 4 2 b 2
detach("package:plyr")
Now I have some trouble getting this right in dplyr
. I found some relevant posts on SO (e.g. this and this), and on github (this and this), but I have trouble to adapt them to my needs. Here are some desperate attempts:
library(dplyr)
df %>%
group_by(id) %>%
do((function(x){
r = rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
filter(x, x$site != r$values[2])
} else x
})(.))
# desired row is not removed
df %>%
group_by(id) %>%
do(function(x){
r = rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
x[!(x$site == r$values[2]), ]
} else x
})
# Error: Results are not data frames at positions: 1, 2, 3
This attempt happens to work (gives same result as ddply
above), but is very far from elegant, and I doubt it's 'the right way':
df %>%
group_by(id) %>%
do(r = rle(.$site)) %>%
do(data.frame(id = .$id,
len = length(.$r$values),
site = .$r$values[2],
len2 = .$r$lengths[2])) %>%
filter(len == 3, len2 == 1) %>%
select(-len) %>%
left_join(df, ., by = c("id", "site")) %>%
filter(!(len2 %in% 1 & stage == 2)) %>%
select(-len2)
How to do
this properly? WWHWD?