2

I'm trying to use dplyr to take the first and last rows of repeated values by group. I'm doing this for efficiency reasons, particularly so that graphing is faster.

This is not a duplicate of Select first and last row from grouped data because I'm not asking for the strict first and last row in a group; I'm asking for the first and last row in a group by level (in my case 1's and 0's) that may appear in multiple chunks.

Here's an example. Say I want to remove all the redundant 1's and 0's from column C while keeping A and B intact.

df = data.frame(
    A = rep(c("a", "b"), each = 10),
    B = rep(c(1:10), 2),
    C = c(1,0,0,0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,1))

A  B C
a  1 1
a  2 0
a  3 0
a  4 0
a  5 0
a  6 0
a  7 1
a  8 1
a  9 1
a 10 1
b  1 0
b  2 0
b  3 0
b  4 1
b  5 0
b  6 0
b  7 0
b  8 0
b  9 0
b 10 1

The end result should look like this:

A  B C
a  1 1
a  2 0
a  6 0
a  7 1
a 10 1
b  1 0
b  3 0
b  4 1
b  5 0
b  9 0
b 10 1

Using unique will either not remove anything or just take one of the 1's or 0's without retaining the start-and-end quality that I'm trying to achieve. Is there a way to do this without a loop, perhaps using dplyr or forcats?

Community
  • 1
  • 1
Nancy
  • 3,989
  • 5
  • 31
  • 49
  • @zx8754 I edited my question to detail what I see as differences between what I'm asking and the question you linked to. They get at different outcomes that, for example, have useful ramifications for longitudinal cohort data. – Nancy Mar 30 '17 at 22:27
  • 1
    In base R the unwieldy `df[unique(unlist(tapply(seq_along(df$C), list(df$A, cumsum(c(1, abs(diff(df$C))))), FUN=function(i) c(head(i,1), tail(i, 1))))),]` works for the example. – lmo Mar 31 '17 at 14:22

2 Answers2

3

I think that slice should get you close:

df %>%
  group_by(A,C) %>%
  slice(c(1, n()))

gives

      A     B     C
  <chr> <int> <dbl>
1     a     2     0
2     a     6     0
3     a     1     1
4     a    10     1
5     b     1     0
6     b     9     0
7     b     4     1
8     b    10     1

though this doesn't quite match your expected outcome. n() gives the last row in the group.

After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice will work as described before. However, because some of your runs are only 1 row long, we need to only include n() if it is more than 1 (otherwise the 1 row shows up twice).

df %>%
  mutate(groupChanged = (C != lag(C, default = C[1]))
         , toCutBy = cumsum(groupChanged)
         ) %>%
  group_by(toCutBy) %>%
  slice(c(1, ifelse(n() == 1, NA, n())))

Gives

       A     B     C groupChanged toCutBy
   <chr> <int> <dbl>        <lgl>   <int>
1      a     1     1        FALSE       0
2      a     2     0         TRUE       1
3      a     6     0        FALSE       1
4      a     7     1         TRUE       2
5      a    10     1        FALSE       2
6      b     1     0         TRUE       3
7      b     3     0        FALSE       3
8      b     4     1         TRUE       4
9      b     5     0         TRUE       5
10     b     9     0        FALSE       5
11     b    10     1         TRUE       6

If the runs of 1 or 0 must stay within the level in column A, you also need to add a check for a change in column A to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.

df %>%
  mutate(groupChanged = (C != lag(C, default = C[1]) |
                           A != lag(A, default = A[1]))
         , toCutBy = cumsum(groupChanged)
  ) %>%
  group_by(toCutBy) %>%
  slice(c(1, ifelse(n() == 1, NA, n())))
Mark Peterson
  • 9,370
  • 2
  • 25
  • 48
  • Interesting, that's better than what I've been doing. It's still missing a few rows... like a-7, b-3, and b-5. Do you know why that is? – Nancy Mar 30 '17 at 19:58
  • I'm not sure why this question/answer attracted such a downvote brigade. Upvoted for useful contribution. – Nancy Mar 30 '17 at 22:08
  • I noticed that you didn't do group_by(A); did intend to leave it out or is it superfluous? – Nancy Mar 31 '17 at 17:29
  • Your question only specified that column `C` controlled the groups -- I've added an approach that also checks if `A` changed (gives the same output in this example) – Mark Peterson Mar 31 '17 at 17:40
  • Amazing, thanks. I tried to provide an illustrative example but it was difficult to make a minimal/reproducible example that was also generalizable to future users. – Nancy Mar 31 '17 at 17:42
0

One solution:

C_filter <- function(x) {
    !sapply(1:length(x), function(i) {
        identical(x[i], x[i-1])
    }) | !sapply(1:length(x), function(i) {
        identical(x[i], x[i+1])
    }) 
}
df %>% group_by(A) %>% filter(C_filter(C))

   A  B C
1  a  1 1
2  a  2 0
3  a  6 0
4  a  7 1
5  a 10 1
6  b  1 0
7  b  3 0
8  b  4 1
9  b  5 0
10 b  9 0
11 b 10 1
thc
  • 9,527
  • 1
  • 24
  • 39
  • I'm not sure why this question/answer attracted such a downvote brigade. Upvoted for useful contribution. – Nancy Mar 30 '17 at 22:08