Anonymous function with dplyr::do - using results from rle to filter data

Question

I have time series data grouped by subject ('id'), which stay on certain 'site' and have a certain 'stage' in each 'time' step.

Sometimes subjects switch from one site to another, and possibly back again. If individuals switch site back and forth (e.g. from site 'a' to site 'b', and then back to site 'a') and if there is only one registration on the middle site (in a transition a-b-a, then site 'b' would here be considered a 'middle site') and the individual is in a certain stage (here, stage = 2) at the middle site, then I wish to remove the registration at this time step.

My dummy data consists of four subjects. Three of them (subject 1-3) have moved from site a to b, and then back to site b, and one has moved from a to b.

The first two subjects both have a single registration on the middle site. Subject 1 is in stage 1 on the middle site and I wish to keep that registration. Subject 2 on the other hand is in stage 2 on the middle site and this registration should be removed. Subject 3, has also moved back and forth between a and b. However, although it is in stage 2 on the middle site b, it has two registrations there and both registrations are kept. Subject 4 has moved from site a to b, but not back again. Thus, although it is in stage 2 on site b, the registration on site b is not a 'middle site' and should be kept.

The data:

df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
                     time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
                     site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
                     stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
                     .Names = c("id", "time", "site", "stage"),
                row.names = c(NA, -12L), class = "data.frame")

df

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1 <~~ A single middle registration on site 2
# 3   1    3    a     1     However, the individual is in stage 1: -> keep 

# 4   2    1    a     1
# 5   2    2    b     2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6   2    3    a     1

# 7   3    1    a     1
# 8   3    2    b     2 <~~ Two middle registrations with stage 2: -> keep both rows 
# 9   3    3    b     2 <~~
# 10  3    4    a     1

# 11  4    1    a     1 
# 12  4    2    b     2 <~~ A single registration on site 2 with stage 2,
#                            but it is not in between two sites: -> keep

Thus, in the test data, it is only the registration at time = 2 for id = 2 which should be removed.

Previously, I have used plyr::ddply and result from rle to solve the problem:

For each individual, calculate run lengths of site (rle(x$site))
If:
- back and forth between sites (e.g. from a to b, and back to a) (length(r$values) > 2) &
- only one registration on middle site (r$lengths[2] == 1) &
- stage on middle site is 2 (x$stage[x$site == r$values[2]][1] == 2)
Then: remove registration on middle site x[!(x$site == r$values[2]), ])

library(plyr)

ddply(df, .(id), function(x){
  r <- rle(x$site)
  if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
    x[x$site != r$values[2], ]
  } else x
})

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1

# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5   2    3    a     1 <~~

# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1

# 10  4    1    a     1
# 11  4    2    b     2

detach("package:plyr")

Now I have some trouble getting this right in dplyr. I found some relevant posts on SO (e.g. this and this), and on github (this and this), but I have trouble to adapt them to my needs. Here are some desperate attempts:

library(dplyr)

df %>%
  group_by(id) %>%
  do((function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
    filter(x, x$site != r$values[2])
  } else x
})(.))
# desired row is not removed

df %>%
  group_by(id) %>%
  do(function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
      x[!(x$site == r$values[2]), ]
    } else x
  })
# Error: Results are not data frames at positions: 1, 2, 3

This attempt happens to work (gives same result as ddply above), but is very far from elegant, and I doubt it's 'the right way':

df %>%
  group_by(id) %>%
  do(r = rle(.$site)) %>%  
  do(data.frame(id = .$id,
                len = length(.$r$values),
                site = .$r$values[2],
                len2 = .$r$lengths[2])) %>%
  filter(len == 3, len2 == 1) %>%
  select(-len) %>%
  left_join(df, ., by = c("id", "site")) %>%
  filter(!(len2 %in% 1 & stage == 2)) %>%
  select(-len2)

How to do this properly? WWHWD?

I wanted to benchmark the approaches but noticed that your last approach doesn't work (at least on my machine) when I use your input df. The error is in the `do(r = rle(.$site))` and reads: "Error in rle(.$site) : 'x' must be a vector of an atomic type". — talat, Nov 28 '14 at 16:37
@beginneR, It seems like you have 'site' as `factor`, not as `character` (I have `stringsAsFactors = FALSE` in my `option`). I `dput` df instead. I bet your method is faster than my convoluted attempt... — Henrik, Nov 30 '14 at 18:03

talat · Accepted Answer · 2014-12-01T13:32:49.473

3

I'm not sure whether I fully understood the logic behind your code, but this might be another way to get to the same result, perhaps with some modifications if needed:

df %>% 
  group_by(id) %>%
  group_by(grp = cumsum(abs(c(1, diff(as.numeric(site))))), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

#Source: local data frame [9 x 5]
#Groups: id, grp
#
#  id time site stage grp
#1  1    1    a     1   1
#2  1    2    b     1   2
#3  1    3    a     1   3
#4  2    1    a     1   1     <~~ row in between 
#5  2    3    a     1   3     <~~ was removed
#6  3    1    a     1   1
#7  3    2    b     2   2
#8  3    3    b     2   2
#9  3    4    a     1   3

This approach assumes that the "middle group" is always the second "grp".

Might be even nicer to create a function - which I'm gonna call intergroup() since it creates groups inside the grouped data, and use that:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

df %>% 
  group_by(id) %>%
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

Edit after OP question update.

Try the following adjusted code for the adjusted problem:

df %>% 
  group_by(id) %>%
  mutate(z = lag(site, 1) != lead(site, 1)) %>%   # check if site before and after are not the same
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2 & !is.na(z))) %>%  # check for NA in z
  ungroup() %>% select(-c(z, grp))  

#Source: local data frame [11 x 4]
#
#   id time site stage
#1   1    1    a     1
#2   1    2    b     1
#3   1    3    a     1
#4   2    1    a     1
#5   2    3    a     1
#6   3    1    a     1
#7   3    2    b     2
#8   3    3    b     2
#9   3    4    a     1
#10  4    1    a     1
#11  4    2    b     2    <~~ row is kept

edited Dec 01 '14 at 13:32

answered Nov 28 '14 at 09:32

talat

68,970
21
126
157

Thanks a lot for your response. The `cumsum` grouping variable was definitely more elegant than my `rle` version (+1). – Henrik Nov 28 '14 at 12:17
@Henrik, sure thing, thanks for letting me know. Btw, maybe [this answer](http://stackoverflow.com/a/27202707/3521006) by akrun is also interesting for you, since he uses rle in a dplyr chain, not in `do`, though. – talat Nov 30 '14 at 22:52
In my post I wrote "_Sometimes_ subjects switch from one site to another, and _possibly_ back again". It was a mistake from my side not to include a test case with an individual which _not_ moves back and forth from the start. I have updated my data (added id 4). I notice that your answer needs to be adjusted to handle such 'non-middle' cases. Sorry for not providing this case from the very start. – Henrik Dec 01 '14 at 12:45
@Henrik, it's a little unclear how exactly those middle cases are defined. A simple solution for the adjusted case might be `filter(!(grp == 2 & n() == 1 & stage == 2) | n_distinct(grp) <= 2)` but I'm not sure whether that is enough for your requirements (e.g. when there is no "middle case": does that mean there are only 1 or 2 different grp entries at maximum or how is the definition?) – talat Dec 01 '14 at 12:50
@Henrik, I updated my answer. Does that fulfill the requirements? It's based on the fact that you only want to remove rows where there is only a single middle case row and, hence, you can just check the lagging and leading rows for equality. – talat Dec 01 '14 at 13:35
Thanks again for taking your time. I learned several nice tricks from your answer. Very helpful indeed. – Henrik Dec 01 '14 at 13:37

score 2 · Answer 2 · edited May 23 '17 at 10:28

Here's an rle alternative which doesn't rely on do. The code was inspired by this answer by @akrun (posted right after my question; thanks to @beginneR for the heads up).

df %>%
  group_by(id) %>%
  mutate(site_idx = with(rle(site),
                           rep(x = seq_along(lengths), times = lengths))) %>%
  filter(!(n_distinct(site_idx) > 2 & sum(site_idx == 2) == 1 &
           site_idx == 2 & stage == 2)) %>%
  select(-site_idx)

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1
# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 has been removed
# 5   2    3    a     1 <~~
# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1
# 10  4    1    a     1
# 11  4    2    b     2

I think you have one `group_by` too many - or what is the second supposed to do? (I think you only need mutate here and could then remove the third, group_by, no?) — talat, Dec 01 '14 at 15:17

Anonymous function with dplyr::do - using results from rle to filter data

2 Answers2

Linked