13

It's best explained with an example.

I have a vector, or column from data.frame named vec:

vec <- c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA)

I would like a vectorized process (not a for loop) to change the three trailing NA when a 1 is observed.

The end vector would be:

c(NA, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA, NA, NA)

If we had:

vec <- c(NA, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA, NA, NA, NA)

The end vector would look like:

c(NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA)

A very badly written solution is:

vec2 <- vec
for(i in index(v)){
  if(!is.na(v[i])) vec2[i] <- 1
  if(i>3){
    if(!is.na(vec[i-1])) vec2[i] <- 1
    if(!is.na(vec[i-2])) vec2[i] <- 1
    if(!is.na(vec[i-3])) vec2[i] <- 1
  }
  if(i==3){
    if(!is.na(vec[i-1])) vec2[i] <- 1
    if(!is.na(vec[i-2])) vec2[i] <- 1
  }
  if(i==2){
    if(!is.na(vec[i-1])) vec2[i] <- 1
  }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • 2
    Welp, there is no vectorized `seq` in R. The closest I could think of is using `Vectorize` as in `seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to")) ; indx <- which(!is.na(vec)) ; vec[seq2(indx, indx + 3)] <- 1`. Taken from [here](https://stackoverflow.com/a/15917618/3001626). But then again, `Vectorize` is just `mapply`. So you could just write a vecotrized `seq` using Rcpp probably – David Arenburg Jul 04 '17 at 14:27
  • Possible duplicate of [Fill NA in a time series only to a limited number](https://stackoverflow.com/questions/25940241/fill-na-in-a-time-series-only-to-a-limited-number) – raymkchow Jul 04 '17 at 14:42
  • 1
    @raymkchow thanks did not find that post. Tanks everyone for the solution, I excepted the one that run faster. – dimitris_ps Jul 04 '17 at 15:11
  • 4
    @dimitris_ps if the accepted answer is what you were looking for (not a vectorized solution) then this is exactly the same thing I've posted in my original comment and this is just a dupe. – David Arenburg Jul 04 '17 at 17:52

7 Answers7

19

Another option:

`[<-`(vec,c(outer(which(vec==1),1:3,"+")),1)
# [1] NA NA  1  1  1  1 NA  1  1  1  1 NA NA NA

Although the above works with the examples, it stretches the length of vec if a 1 is found in the last positions. Better to make a simple check and wrap into a function:

threeNAs<-function(vec) {
    ind<-c(outer(which(vec==1),1:3,"+"))
    ind<-ind[ind<=length(vec)]
    `[<-`(vec,ind,1)
}
Jaap
  • 81,064
  • 34
  • 182
  • 193
nicola
  • 24,005
  • 3
  • 35
  • 56
  • 1
    A minor alternative to this would be the following: `spots <- unique(c(outer(which(!is.na(vec)), 1:3, "+"))); vec[spots[spots <= length(vec)]] <-1`. This should account for duplicate indices as well as maintaining the original length. – lmo Jul 16 '17 at 20:30
14

Another fast solution:

vec[rep(which(vec == 1), each = 3) + c(1:3)] <- 1

which gives:

> vec
 [1] NA NA  1  1  1  1 NA  1  1  1  1 NA NA NA

Benchmarking is only really useful when done on larger datasets. A benchmark with a 10k larger vector and the several posted solutions:

library(microbenchmark)

microbenchmark(ans.jaap = {vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4); 
                           vec[rep(which(vec == 1), each = 3) + c(1:3)] <- 1},
               ans.989 = {vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
                          r <- which(vec==1);
                          vec[c(mapply(seq, r, r+3))] <- 1},
               ans.sotos = {vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
                            vec[unique(as.vector(t(sapply(which(vec == 1), function(i) seq(i+1, length.out = 3)))))] <- 1},
               ans.gregor = {vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
                             vec[is.na(vec)] <- 0;
                             n <- length(vec);
                             vec <- vec + c(0, vec[1:(n-1)]) + c(0, 0, vec[1:(n-2)]) + c(0, 0, 0, vec[1:(n-3)]);
                             vec[vec == 0] <- NA},
               ans.moody = {vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
                            output <- sapply(1:length(vec),function(i){any(!is.na(vec[max(0,i-3):i]))});
                            output[output] <- 1;
                            output[output==0] <- NA},
               ans.nicola = {vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
                             `[<-`(vec,c(outer(which(vec==1),1:3,"+")),1)})

which gives the following benchmark:

Unit: microseconds
       expr        min         lq       mean     median         uq        max neval   cld
   ans.jaap   1778.905   1937.414   3064.686   2100.595   2257.695  86233.593   100 a    
    ans.989  87688.166  89638.133  96992.231  90986.269  93326.393 182431.366   100   c  
  ans.sotos 125344.157 127968.113 132386.664 130117.438 132951.380 214460.174   100    d 
 ans.gregor   4036.642   5824.474  10861.373   6533.791   7654.587  87806.955   100  b   
  ans.moody 173146.810 178369.220 183698.670 180318.799 184000.062 264892.878   100     e
 ans.nicola    966.927   1390.486   1723.395   1604.037   1904.695   3310.203   100 a
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 6
    Thanks for including me in the benchmark and also for undeleting (not sure if you or @DavidArenburg: thank you David if so) my answer. I deleted it because the `which()` part was indeed common. The rest of the accepted answer's implementation was poor, and I posted a comment hoping for an edit which would have made my answer useless. – nicola Jul 05 '17 at 14:24
5

What really is 'vectorised', if not a loop written in a C-language?

Here's a C++ loop that benchmarks well.

vec <- c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA)

library(Rcpp)

cppFunction('NumericVector fixVec(NumericVector myVec){

    int n = myVec.size();
    int foundCount = 0;

    for(int i = 0; i < n; i++){
      if(myVec[i] == 1) foundCount = 1; 

      if(ISNA(myVec[i])){
        if(foundCount >= 1 & foundCount <= 3){
          myVec[i] = 1;
          foundCount++;
        }
      }
    }
    return myVec;
    }')

fixVec(vec)
# [1] NA NA  1  1  1  1 NA  1  1  1  1 NA NA NA

Benchmarks

library(microbenchmark)

microbenchmark(
      ans.jaap = {
        vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4); 
      vec[rep(which(vec == 1), each = 4) + c(0:3)] <- 1
},

    ans.nicola = {
        vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
      `[<-`(vec,c(outer(which(vec==1),0:3,"+")),1)
        },

    ans.symbolix = {
        vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4);
      vec <- fixVec(vec)
        }
)

# Unit: microseconds
# expr              min       lq      mean   median        uq       max neval
# ans.jaap     2017.789 2264.318 2905.2437 2579.315 3588.4850  4667.249   100
# ans.nicola   1242.002 1626.704 3839.4768 2095.311 3066.4795 81299.962   100
# ans.symbolix  504.577  533.426  838.5661  718.275  966.9245  2354.373   100


vec <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4)
vec <- fixVec(vec)

vec2 <- rep(c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA),1e4)
vec2[rep(which(vec2 == 1), each = 4) + c(0:3)] <- 1

identical(vec, vec2)
# [1] TRUE
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • 1
    The language in which the loop is written is not *that* crucial, the *main* problem is with evaluation an R functions in each iteration compared to doing everything in C. – David Arenburg Jul 05 '17 at 08:24
3

The following code does what you asked for. It involves "shifting" the vector and then adding the shifted versions

vec[is.na(vec)] <- 0                                 
n <- length(vec)                                     
vec <- vec + c(0, vec[1:(n-1)]) + c(0, 0, vec[1:(n-2)]) + c(0, 0, 0, vec[1:(n-3)])  
vec[vec == 0] <- NA                                    
vec[vec != 0] <- 1                                     

# vec                    |   0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0 ,0, 0
# c(0, vec[1:(n-1)])     | + 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 ,0, 0
# c(0, 0, vec[1:(n-2)])  | + 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 ,0
# c(0,0,0,vec[1:(n-3)])  | + 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 
#                        |-------------------------------------------
#                        |   0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0 
Gregor de Cillia
  • 7,397
  • 1
  • 26
  • 43
2

A non-Vectorized solution, but nevertheless, another option using base R,

vec[unique(as.vector(t(sapply(which(vec == 1), function(i) seq(i+1, length.out = 3)))))] <- 1
vec
#[1] NA NA  1  1  1  1 NA  1  1  1  1 NA NA NA

vec1[unique(as.vector(t(sapply(which(vec1 == 1), function(i) seq(i+1, length.out = 3)))))] <- 1
vec1
#[1] NA NA  1  1  1  1  1  1  1  1  1 NA NA NA
Sotos
  • 51,121
  • 6
  • 32
  • 66
0

With sapply, any, and is.na:

output <- sapply(1:length(vec),function(i){any(!is.na(vec[max(0,i-3):i]))})
output[output] <- 1
output[output==0] <- NA
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
0

How about this:

r <- which(vec==1)
vec[c(mapply(seq, r, r+3))] <- 1

Examples:

vec <- c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA)
#[1] NA NA  1  1  1  1 NA  1  1  1  1 NA NA NA

vec <- c(NA, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA, NA, NA, NA)
#[1] NA NA  1  1  1  1  1  1  1  1  1 NA NA NA
Jaap
  • 81,064
  • 34
  • 182
  • 193
989
  • 12,579
  • 5
  • 31
  • 53
  • It might be faster with a `unique` before the `c`. But it feels wrong to make it longer :) – moodymudskipper Jul 04 '17 at 15:09
  • 4
    This code will change the length of the vector in case there is a `1` on the end. For example `vec <- c(NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1)` – Gregor de Cillia Jul 04 '17 at 15:25
  • @GregordeCillia Good point but it can be easily handled like `x <- c(mapply(seq, r, r+3)); vec[x[x<=len]] <- 1` – 989 Jul 04 '17 at 15:32
  • 4
    How is that vectorized in a sense of not using a loop? – David Arenburg Jul 04 '17 at 17:46
  • @DavidArenburg In that sense, `outer(r, 0:3, '+')` can be in place of `mapply(seq, r, r+3)`. – 989 Jul 04 '17 at 17:54
  • 9
    I see you hide behind flags, so here goes again: `outer(r, 0:3, '+')` is no way similar to `mapply(seq, r, r+3)` because A: `outer` is fully vectorized while `mapply` is just a loop. B: Running `+` is both is much cheaper (computationally) than `seq` and a whole different concept in general. @nicolas idea was both better, met the vectoization requirements, and not even remotely similar to your solution. Also, please see the benchmarks and the ~X60 time speed differences. – David Arenburg Jul 04 '17 at 21:43
  • 9
    Don't include complaints about voting behaviour in your answer. – Jaap Jul 05 '17 at 16:58
  • 5
    @Jaap very correct,although I do understand his frustration. It's not vectorized in the strict sense of the word, but it's a valid answer that doesn't deserve that many downvotes (8 to be precise). And I can't understand why somebody took it as far as voting to delete this. That's just being mean, not helping SO forward whatsoever. – Joris Meys Jul 05 '17 at 19:50
  • 1
    @JorisMeys I understand it as well and as far as I'm concerned the answer can stay; it is a valid answer. – Jaap Jul 05 '17 at 20:02
  • 6
    @JorisMeys if the question asks for a "vectorized" (for which we might read fast, or efficient) solution and someone answers with a non-vectorized solution, is that a good (or to use SO terminology "useful") answer? If you think this is a useful answer you up-vote; those that don't find it useful down-vote. I don't see what "valid" has to do with this? – Gavin Simpson Jul 05 '17 at 21:06
  • 3
    @GavinSimpson now we can go into a semantic discussion about why `mapply` isn't a truly vectorized function but also "not a for loop" (as OP asked), but that's splitting hairs. As it isn't a great solution, it doesn't deserve an upvote. As it does do the job and not uses a for loop, it's not bad enough for a downvote for me either. So I just leave it alone and upvote the other answers. It would be literally the first time in 5 years I see a valid solution be downvoted to -5 at one point. That's what it has to do with this. YMMV – Joris Meys Jul 06 '17 at 10:00
  • @JorisMeys I'm surprised anything much in the R tag on [so] garners many downvotes, unless it is publicised in some way. Many of us bemoan the poor quality of a lot of answers here but we, me included, don't often do much about it. My point on this Q was more that "valid" isn't one of the criteria; at it's simplest it is "useful" vs "not useful", the latter conveying nothing about the validity of the answer. – Gavin Simpson Jul 06 '17 at 15:13