1

I have a large dataset in which one columns contains lots of sequential runs of different lengths (e.g. 1,2,1,2,3,1,2,1,2,3,4,5). A lot of these runs are only 1,2 - then start from 1 again.

I would like to somehow group/split my data into all the rows contained within one of those runs. Let's say for simplicity's sake that my data looks something like this (I have updated column a to contain the first 15 values from my actual data to illustrate the problem):

    df <- data.frame(a = c(1:2, 1:5, 1:2, 1:4, 1:2), b = c(LETTERS[1:15]), c = c(1:15))

 a b  c
1  1 A  1
2  2 B  2
3  1 C  3
4  2 D  4
5  3 E  5
6  4 F  6
7  5 G  7
8  1 H  8
9  2 I  9
10 1 J 10
11 2 K 11
12 3 L 12
13 4 M 13
14 1 N 14
15 2 O 15

I want to get either a group (or a new columns with a conditionally assigned value that I can use to group/split) containing the first run in df[,1] (consisting of rows 1, 2 and 3), one containing the second run (rows 4, 5, 6), and so on.

I have tried to do what was suggested in a answer to the unedited version of this question test <- group_by(df, a) %>% mutate(run = seq(length(a)))

But the result is somewhat garbled, which I suspect originates from the fact that a lot of the runs are just going up to 2.

Source: local data frame [15 x 4]
Groups: a [5]

       a      b     c   run
   <int> <fctr> <int> <int>
1      1      A     1     1
2      2      B     2     1
3      1      C     3     2
4      2      D     4     2
5      3      E     5     1
6      4      F     6     1
7      5      G     7     1
8      1      H     8     3
9      2      I     9     3
10     1      J    10     4
11     2      K    11     4
12     3      L    12     2
13     4      M    13     2
14     1      N    14     5
15     2      O    15     5

Edit: The solution

    split(df, cumsum(c(TRUE, diff(df$a)!=1)))

works great.

  • 3
    http://stackoverflow.com/questions/25793981/split-a-vector-by-its-sequences – user20650 Oct 11 '16 at 21:08
  • Thank you for pointing me towards that question. However, both of the answers to it are based on the assumption that all runs are longer that 1:2 and the point of difference is always going to be a difference between following values of >1. This works on the df in my example, but not for the description of my problem (apologies for that), since a lot of my runs only go up to 2. These would be grouped with the following run, so it does not solve my problem. (I have tried the first solution suggested in the other answer, and can't get it to do even that, so maybe I am missing something?) – Gertje Petersen Oct 11 '16 at 21:50
  • If you can edit your question so that it's clear that the other solutions don't work in this case we can reopen. – Gregor Thomas Oct 11 '16 at 21:51
  • 1
    For your (original) example, the translation of the first answer on the dupe would be `split(df, cumsum(c(TRUE, diff(df$a)!=1)))`. To my eye, it works just fine on runs of 1, 2 or more. – Gregor Thomas Oct 11 '16 at 21:54
  • If you replace the `a` definition in your example with `a = c(1, 1, 2, 1, 2, 3, 7, 8, 9)` and run the code from my above comment, you can see a nice example. – Gregor Thomas Oct 11 '16 at 21:57
  • Hi @GertjePetersen ; if you are still having problems with your data, after Gregor's comments, can you edit your question with a small vector that shows the issue. thanks – user20650 Oct 11 '16 at 22:00
  • for some reason, I couldn't get it to work when I first tried it, but it did just run beautifully on my original data. - Thanks, @Gregor – Gertje Petersen Oct 11 '16 at 22:09

1 Answers1

0

The following code creates a run variable as you intend.

df <- data.frame(a = c(1,2,3), b = c(LETTERS[1:9]), c = c(1:9))

library(dplyr)
df <- group_by(df, a) %>% mutate(run = seq(length(a)))
Kota Mori
  • 6,510
  • 1
  • 21
  • 25