3

I have an data file with a time and output column. The output column consists of values 1 and 2. For each run of the output column where it takes the value 2, I want to compute the total time elapsed during the run, namely the end time minus the start time. For example:

    time          output       total
      2                2           4-2=2
      4                2
      6                1
      8                2           10-8=2
      10               2
      12               1
      14               1
      16               1
      18               2           22-18=4
      20               2
      22               2

Is there some easy way to so this for large data frames?

josliber
  • 43,891
  • 12
  • 98
  • 133
amy
  • 65
  • 6

3 Answers3

0

It sounds like you want the time elapsed in each run of output variable where that variable is equal to 2.

One approach would be to use dplyr to group by runs, filter to runs of output type 2, and then compute the elapsed time:

library(dplyr)
dat %>%
  group_by(run={x = rle(output) ; rep(seq_along(x$lengths), x$lengths)}) %>%
  filter(output == 2) %>%
  summarize(total=max(time)-min(time))
# Source: local data frame [3 x 2]
# 
#     run total
#   (int) (dbl)
# 1     1     2
# 2     3     2
# 3     5     4

This could also be done in base R using the rle function:

x <- rle(dat$output)
unname(tapply(dat$time, rep(seq_along(x$lengths), x$lengths), function(x) max(x)-min(x))[x$values == 2])
# [1] 2 2 4
Community
  • 1
  • 1
josliber
  • 43,891
  • 12
  • 98
  • 133
0

Here is another way. I created a group variable called foo with rleid(). For each group, I subtracted the first output value from the last output value, which is total. Then, I replaced all values in total with NA where output is not 2. Then, for each group, I assigned the a vector including the first value of total and NAs. Finally, I dropped the group variable.

library(data.table)

mydf <- data.frame(time = c(2,4,6,8,10,12,14,16,18,20,22),
                   output = c(2,2,1,2,2,1,1,1,2,2,2))

setDT(mydf)[, foo := rleid(output)][,
    total := last(time) - first(time), by = "foo"][,
    total := replace(total, which(output !=2), NA)][,
    total := c(total[1L], rep(NA, .N - 1)), by = "foo"][, -3, with = FALSE][]

#    time output total
# 1:    2      2     2
# 2:    4      2    NA
# 3:    6      1    NA
# 4:    8      2     2
# 5:   10      2    NA
# 6:   12      1    NA
# 7:   14      1    NA
# 8:   16      1    NA
# 9:   18      2     4
#10:   20      2    NA
#11:   22      2    NA
jazzurro
  • 23,179
  • 35
  • 66
  • 76
0

I understood that you wanted to group by 'runs' of outlook ?

First we need to index the 'runs'. I created a function based on rle (I couldn't find anything to do this but it probably already exists).

indexer <-function(x){
  run <- rle(x)$length
  size <- length(run)
  value <- c()
  for(i in 1:size){
    value = c(value, rep(i,run[i]))
    }
  value
  }

df$index <- indexer(df$output)

df %>% group_by(index) %>% mutate(total = max(time) - min(time))

    time output index total
1      2      2     1     2
2      4      2     1     2
3      6      1     2     0
4      8      2     3     2
5     10      2     3     2
6     12      1     4     4
7     14      1     4     4
8     16      1     4     4
9     18      2     5     4
10    20      2     5     4
11    22      2     5     4
Matthieu P.
  • 131
  • 4