11

Here's a sample of booleans I have as part of a data.frame:

atest <- c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)

I want to return a sequence of numbers starting at 1 from each FALSE and increasing by 1 until the next FALSE.

The resulting desired vector is:

[1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1

Here's the code that accomplishes this, but I'm sure there's a simpler or more elegant way to do this in R. I'm always trying to learn how to code things more efficiently in R rather than simply getting the job done.

result <- c()
x <- 1
for(i in 1:length(atest)){
    if(atest[i] == FALSE){
        result[i] <- 1
        x <- 1
    } 
    if(atest[i] != FALSE){
        x <- x+1
         result[i] <- x
    }
}
tcash21
  • 4,880
  • 4
  • 32
  • 39
  • 1
    Re-allocating ("growing") an object in a for loop is a big no-no in R. It's about the slowest thing you can do. – Joshua Ulrich Jul 23 '13 at 20:59
  • I know I tried with an sapply but just wanted to get the basic logic out. Your solution is exactly what I was looking for. – tcash21 Jul 23 '13 at 21:15

3 Answers3

19

Here's one way to do it, using handy (but not widely-known/used) base functions:

> sequence(tabulate(cumsum(!atest)))
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1

To break it down:

> # return/repeat integer for each FALSE
> cumsum(!atest)
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
> # count the number of occurrences of each integer
> tabulate(cumsum(!atest))
[1] 10 10  1
> # create concatenated seq_len for each integer
> sequence(tabulate(cumsum(!atest)))
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 1
    I already +1'd, but I'd do it again as the explanation is really helpful! – Thomas Jul 23 '13 at 20:58
  • 2
    @Joshua Ulrich +1 for this great solution; but it fails if the first element isn't `FALSE`: `sequence(tabulate(cumsum(!atest[-1])))` – sgibb Jul 23 '13 at 20:59
  • 1
    @sgibb: I didn't try the OP's code before I answered, but I see it starts the first sequence at 2 if the first element isn't `FALSE`. That seems inconsistent with their text, "I want to return a sequence of numbers starting at 1 from each FALSE and increasing by 1 until the next FALSE." – Joshua Ulrich Jul 23 '13 at 21:08
  • 1
    This is awesome. My data will always start with a FALSE. I've never used tabulate or sequence, only seq. Thanks so much! – tcash21 Jul 23 '13 at 21:15
5

Here is another approach using other familiar functions:

seq_along(atest) - cummax(seq_along(atest) * !atest) + 1L

Because it is all vectorized, it is noticeably faster than @Joshua's solution (if speed is of any concern):

f0 <- function(x) sequence(tabulate(cumsum(!x)))
f1 <- function(x) {i <- seq_along(x); i - cummax(i * !x) + 1L}
x  <- rep(atest, 10000)

library(microbenchmark)
microbenchmark(f0(x), f1(x))
# Unit: milliseconds
#   expr       min        lq    median        uq      max neval
#  f0(x) 19.386581 21.853194 24.511783 26.703705 57.20482   100
#  f1(x)  3.518581  3.976605  5.962534  7.763618 35.95388   100

identical(f0(x), f1(x))
# [1] TRUE
flodel
  • 87,577
  • 21
  • 185
  • 223
2

Problems like these tend to work well with Rcpp. Borrowing @flodel's code as a framework for benchmarking,

boolseq.cpp
-----------

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector boolSeq(LogicalVector x) {
  int n = x.length();
  IntegerVector output = no_init(n);
  int counter = 1;
  for (int i=0; i < n; ++i) {
    if (!x[i]) {
      counter = 1;
    }
    output[i] = counter;
    ++counter;
  }
  return output;
}

/*** R
x <- c(FALSE, sample( c(FALSE, TRUE), 1E5, TRUE ))

f0 <- function(x) sequence(tabulate(cumsum(!x)))
f1 <- function(x) {i <- seq_along(x); i - cummax(i * !x) + 1L}

library(microbenchmark)
microbenchmark(f0(x), f1(x), boolSeq(x), times=100)

stopifnot(identical(f0(x), f1(x)))
stopifnot(identical(f1(x), boolSeq(x)))
*/

sourceCpping it gives me:

Unit: microseconds
       expr       min        lq     median         uq       max neval
      f0(x) 18174.348 22163.383 24109.5820 29668.1150 78144.411   100
      f1(x)  1498.871  1603.552  2251.3610  2392.1670  2682.078   100
 boolSeq(x)   388.288   426.034   518.2875   571.4235   699.710   100

Less elegant, but pretty darn close to what you were writing with R code.

Kevin Ushey
  • 20,530
  • 5
  • 56
  • 88