4

I have a vector like the following:

xx <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)

I want to find the indexes that have ones and combine them together. In this case, I want the output to look like 1 6 and 11 14 in a 2x2 matrix. My vector is actually very long so I can't do this by hand. Can anyone help me with this? Thanks.

Arun
  • 116,683
  • 26
  • 284
  • 387
user1938809
  • 1,135
  • 1
  • 9
  • 12

4 Answers4

5

Something like this, maybe?

if (xx[1] == 1) {
    rr <- cumsum(c(0, rle(xx)$lengths))
} else {
    rr <- cumsum(rle(xx)$lengths)
}
if (length(rr) %% 2 == 1) {
    rr <- head(rr, -1)
}
oo <- matrix(rr, ncol=2, byrow=TRUE)
oo[, 1] <- oo[, 1] + 1
     [,1] [,2]
[1,]    1    6
[2,]   11   14

This edit takes care of cases where 1) the vector starts with a "0" rather than a "1" and 2) where the number of consecutive occurrences of 1's are odd/even. For ex: xx <- c(1,1,1,1,1,1,0,0,0,0).

Arun
  • 116,683
  • 26
  • 284
  • 387
5

Since the question originally had a tag 'bioinformatics' I'll mention the Bioconductor package IRanges (and it's companion for ranges on genomes GenomicRanges)

> library(IRanges)
> xx <- c(1,1,1,1,1,1,0,0,0,0,1,1,1,1)
> sl = slice(Rle(xx), 1)
> sl
Views on a 14-length Rle subject

views:
    start end width
[1]     1   6     6 [1 1 1 1 1 1]
[2]    11  14     4 [1 1 1 1]

which could be coerced to a matrix, but that would often not be convenient for whatever the next step is

> matrix(c(start(sl), end(sl)), ncol=2)
     [,1] [,2]
[1,]    1    6
[2,]   11   14

Other operations might start on the Rle, e.g.,

> xx = c(2,2,2,3,3,3,0,0,0,0,4,4,1,1)
> r = Rle(xx)
> m = cbind(start(r), end(r))[runValue(r) != 0,,drop=FALSE]
> m
     [,1] [,2]
[1,]    1    3
[2,]    4    6
[3,]   11   12
[4,]   13   14

See the help page ?Rle for the full flexibility of the Rle class; to go from a matrix like that above to a new Rle as asked in the comment below, one might create a new Rle of appropriate length and then subset-assign using an IRanges as index

> r = Rle(0L, max(m))
> r[IRanges(m[,1], m[,2])] = 1L
> r
integer-Rle of length 14 with 3 runs
  Lengths: 6 4 4
  Values : 1 0 1

One could expand this to a full vector

> as(r, "integer")
 [1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1

but often it's better to continue the analysis on the Rle. The class is very flexible, so one way of going from xx to an integer vector of 1's and 0's is

> as(Rle(xx) > 0, "integer")
 [1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1

Again, though, it often makes sense to stay in Rle space. And Arun's answer to your separate question is probably best of all.

Performance (speed) is important, although in this case I think the Rle class provides a lot of flexibility that would weigh against poor performance, and ending up at a matrix is an unlikely end-point for a typical analysis. Nonetheles the IRanges infrastructure is performant

eddi <- function(xx)
    matrix(which(diff(c(0,xx,0)) != 0) - c(0,1),
           ncol = 2, byrow = TRUE)

iranges = function(xx) {
    sl = slice(Rle(xx), 1)
    matrix(c(start(sl), end(sl)), ncol=2)
}

iranges.1 = function(xx) {
    r = Rle(xx)
    cbind(start(r), end(r))[runValue(r) != 0, , drop=FALSE]
}

with

> xx = sample(c(0, 1), 1e5, TRUE)
> microbenchmark(eddi(xx), iranges(xx), iranges.1(xx), times=10)
Unit: milliseconds
          expr       min        lq    median        uq      max neval
      eddi(xx)  45.88009  46.69360  47.67374 226.15084 234.8138    10
   iranges(xx) 112.09530 114.36889 229.90911 292.84153 294.7348    10
 iranges.1(xx)  31.64954  31.72658  33.26242  35.52092 226.7817    10
Community
  • 1
  • 1
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
3

Another, short one:

cbind(start = which(diff(c(0, xx)) == +1),
      end   = which(diff(c(xx, 0)) == -1))
#      start end
# [1,]     1   6
# [2,]    11  14

I tested on a very long vector and it is marginally slower than using rle. But more readable IMHO. If speed were really a concern, you could also do:

xx.diff <- diff(c(0, xx, 0))
cbind(start = which(head(xx.diff, -1) == +1),
      end   = which(tail(xx.diff, -1) == -1))
#      start end
# [1,]     1   6
# [2,]    11  14
flodel
  • 87,577
  • 21
  • 185
  • 223
1

Here's another solution that's built upon the others' ideas, and is a bit shorter and faster:

matrix(which(diff(c(0,xx,0)) != 0) - c(0,1), ncol = 2, byrow = T)
#     [,1] [,2]
#[1,]    1    6
#[2,]   11   14

I didn't test the non-base solution, but here's a comparison of base ones:

xx = sample(c(0,1), 1e5, T)
microbenchmark(arun(xx), flodel(xx), flodel.fast(xx), eddi(xx))
#Unit: milliseconds
#            expr       min        lq    median        uq       max neval
#        arun(xx) 14.021134 14.181134 14.246415 14.332655 15.220496   100
#      flodel(xx) 12.885134 13.186254 13.248334 13.432974 14.367695   100
# flodel.fast(xx)  9.704010  9.952810 10.063691 10.211371 11.108171   100
#        eddi(xx)  7.029448  7.276008  7.328968  7.439528  8.361609   100
eddi
  • 49,088
  • 6
  • 104
  • 155