3

I have a dataset that looks like the following:

ID  Cond    Time1   Time2
1   2       Start   Stop1
1   3       Start   abc
1   1       abc     Stop2
1   2       Start   abc
1   2       abc     Stop1
2   2       Start   abc
2   4       abc     jkl
2   3       abc     jkl
2   2       abc     jkl
2   3       abc     Stop2
3   2       Start   abc
3   3       abc     Stop2
3   2       Start   Stop1
3   3       Start   Stop1
3   3       Start   abc
3   2       abc     jkl
3   4       baba    Stop1
4   2       Start   Stop2
4   1       Start   asd
4   2       abc     Stop2

And I need to filter the data based on a couple of criteria. If Cond = 2 and Time1 = Start, and I need to filter until the first stopping point (either Stop1 or Stop2). Essentially, it should look like this:

ID  Cond    Time1   Time2
1   2       Start   Stop1
1   2       Start   abc
1   2       abc     Stop1
2   2       Start   abc
2   4       abc     jkl
2   3       abc     jkl
2   2       abc     jkl
2   3       abc     Stop2
3   2       Start   abc
3   3       abc     Stop2
3   2       Start   Stop1
4   2       Start   Stop2

Also, the real dataset has over 140,000 observations, so efficienicy is key. I was thinking about using the dplyr package, but not sure how to go about this problem.

akash87
  • 3,876
  • 3
  • 14
  • 30

3 Answers3

2

Using dplyr

dframe = read.table(header = T, text = "ID  Cond    Time1   Time2
1   2       Start   Stop1
                    1   3       Start   abc
                    1   1       abc     Stop2
                    1   2       Start   abc
                    1   2       abc     Stop1
                    2   2       Start   abc
                    2   4       abc     jkl
                    2   3       abc     jkl
                    2   2       abc     jkl
                    2   3       abc     Stop2
                    3   2       Start   abc
                    3   3       abc     Stop2
                    3   2       Start   Stop1
                    3   3       Start   Stop1
                    3   3       Start   abc
                    3   2       abc     jkl
                    3   4       baba    Stop1
                    4   2       Start   Stop2
                    4   1       Start   asd
                    4   2       abc     Stop2")

library(dplyr)

# add index
dframe = data.frame(index = 1:nrow(dframe), dframe)
head(dframe)

# get starting points
start_points = dframe %>%
  filter(Cond == 2 & Time1 == 'Start') %>%
  select(index, ID)

# get stopping points
stop_points = dframe %>%
  filter(substr(Time2, 1, 4) == 'Stop') %>%
  select(index, ID)

# get the stopping point associated with each start point
start_stop = start_points %>%
  left_join(stop_points, by = "ID") %>%
  filter(index.x <= index.y) %>%
  group_by(ID, index.x) %>%
  summarise(index.y = min(index.y)) %>%
  ungroup() %>%
  rename(start_index = index.x, stop_index = index.y)

# add rows between
result = start_stop %>%
  left_join(dframe, by = "ID") %>%
  filter(start_index <= index, index <= stop_index) %>%
  select(-c(start_index, stop_index, index))

> result
Source: local data frame [12 x 4]

ID  Cond  Time1  Time2
(int) (int) (fctr) (fctr)
1      1     2  Start  Stop1
2      1     2  Start    abc
3      1     2    abc  Stop1
4      2     2  Start    abc
5      2     4    abc    jkl
6      2     3    abc    jkl
7      2     2    abc    jkl
8      2     3    abc  Stop2
9      3     2  Start    abc
10     3     3    abc  Stop2
11     3     2  Start  Stop1
12     4     2  Start  Stop2
mbiron
  • 3,933
  • 1
  • 14
  • 16
2

Another data.table solution:

library(data.table)
setDT(DF)
DF[,     s0 := cumsum(Cond==2 & Time1 == "Start")]
DF[.N:1, s1 := cumsum(Time2 %like% "Stop")]

DF[, .SD[ s1 == s1[1L] ], by=s0]

    s0 ID Cond Time1 Time2 s1
 1:  1  1    2 Start Stop1 10
 2:  2  1    2 Start   abc  8
 3:  2  1    2   abc Stop1  8
 4:  3  2    2 Start   abc  7
 5:  3  2    4   abc   jkl  7
 6:  3  2    3   abc   jkl  7
 7:  3  2    2   abc   jkl  7
 8:  3  2    3   abc Stop2  7
 9:  4  3    2 Start   abc  6
10:  4  3    3   abc Stop2  6
11:  5  3    2 Start Stop1  5
12:  6  4    2 Start Stop2  2

.SD is the Subset of Data associated with each by=s0 group. The .N:1 in the second line reverses the data temporarily to create s1. If you don't want to keep the new columns around, they can be removed like DF[, s0 := NULL][, s1 := NULL] or DF[, c("s0", "s1") := NULL].

If the last line is slow, it is worth trying @eddi's approach:

DF[DF[, .I[ s1 == s1[1L] ], by=s0]$V1]
Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Good use of `cumsum` – akrun Jun 21 '16 at 18:15
  • @Frank I realized this includes all other conditions as well when I apply it to my dataset – akash87 Jun 24 '16 at 17:57
  • @akash87 I am not sure what you mean without additional context. If you mean that start and stop must always both appear in a run of Cond==2... I guess your example and initial description did not cover it, yeah. – Frank Jun 24 '16 at 19:52
  • 1
    @Frank I meant that it must start with a `Cond == 2` and end at the `Stop1` or `Stop2`. I think I misread your code, but it works now! – akash87 Jun 25 '16 at 23:01
1

You can use Map to conditionally construct the series of rows you want to select, where an anonymous function can be used to judge if the start time has condition 2. Here is a solution, where we used the data.table for syntax sugar:

library(data.table)
setDT(df)
df[unlist(Map(function(t1, t2) if(t1 %in% which(Cond == 2)) t1:t2 else NULL, 
              which(Time1 == "Start"), which(grepl("Stop", Time2))))]
    ID Cond Time1 Time2
 1:  1    2 Start Stop1
 2:  1    2 Start   abc
 3:  1    2   abc Stop1
 4:  2    2 Start   abc
 5:  2    4   abc   jkl
 6:  2    3   abc   jkl
 7:  2    2   abc   jkl
 8:  2    3   abc Stop2
 9:  3    2 Start   abc
10:  3    3   abc Stop2
11:  3    2 Start Stop1
12:  4    2 Start Stop2

To boost the performance a little bit:

df[{ cond2 = which(Cond == 2); 
     unlist(Map(function(t1, t2) if(t1 %in% cond2) t1:t2 else NULL, 
                which(Time1 == "Start"), which(grepl("Stop", Time2)))) }]
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • in your code, why do you have `t1:t2`? I am not as familiar with the `data.table` package. – akash87 Jun 21 '16 at 15:03
  • `t1:t2` will construct a vector of rows from where `Time1 == "Start"` to where `Time2 == "Stop1/2"` but conditionally which happens only when the `Start` has condition `2` which I think is what you mean. `data.table` here is mainly to save some typing, probably will have some performance boost. This should also work in base `R`. – Psidom Jun 21 '16 at 15:07
  • Interesting! Now I have read that `Map` is a wrapper to mapply, Do you use `map` because of how `data.table` reads a data frame? – akash87 Jun 21 '16 at 15:11
  • 1
    No I use `Map` because it has only three letters. :) – Psidom Jun 21 '16 at 15:13