Using R, I have the following vector:
x <- c(1,1,2,1,3,-99,-99,3,rep(-99,9),1,2,2,0,1,-99)
x
[1] 1 1 2 1 3 -99 -99 3 -99 -99 -99 -99 -99 -99 -99 -99 -99 1 2 2 0 1 -99
I want to remove consecutive values, but only if the consecutive values are more than a threshold, for example 5. So in this case the output result should be:
[1] 1 1 2 1 3 -99 -99 3 1 2 2 0 1 -99
I know I must use rle
and maybe diff
to do this, but I cannot figure out how to do this efficiently.
I do not think the suggested duplicate questions are actually duplicates, as in this case finding and removing only a subset of the values is the main part of the question. If this was not the case, using rle
and duplicates
would indeed be enough.
I've come up with this, but I'm sure there is a way better way, especially because this works only for the first instance of cuplicated values:
r <- rle(x)
toRemove <- which(r$lengths > 5)
startdupl <- sum(r$lengths[1:(toRemove-1)])+1
x[-(startdupl:(startdupl+r$lengths[toRemove]-1))]
The procedure instead should of course work for multiple duplicates of lengths >5.
Bonus points if I can replace the values with NA instead of removing them!
Extra bonus points for using dplyr
and/or making something faster than the functions below!
After a couple of good suggestions, here are some options I am considering and a small benchmark for a vector of 30000 elements:
f1 <- function(x) { inverse.rle(within.list(rle(x), values[lengths>5] <- NA))}
f2 <- function(x) {
r <- rle(x)
r$values[which(r$lengths>5)] <- NA
with(r, rep(values, lengths))
}
f3 <- function(x) {as.vector(unlist(sapply(split(x, cumsum(c(1,
diff(x) != 0))), function(i) replace(i, length(i) >= 5, NA))))}
f4 <- function(x) {do.call(c, sapply(split(x, cumsum(c(1, diff(x) != 0))), function(i) replace(i, length(i) >= 5, NA)))}
Results:
library(microbenchmark)
microbenchmark(f1(x), f2(x), f3(x), f4(x))
Unit: microseconds
expr min lq mean median uq max neval
f1(x) 559.445 602.3215 770.5779 652.395 660.6705 13108.82 100
f2(x) 542.203 560.0705 882.0940 611.087 618.6395 14982.19 100
f3(x) 50513.630 55523.6960 59338.0722 57408.724 60003.0870 145707.49 100
f4(x) 52599.398 57648.0445 60722.3351 60098.227 62113.3655 124074.32 100