Remove consecutive duplicates from a vector, only if more than 5 consecutive

Question

Using R, I have the following vector:

x <- c(1,1,2,1,3,-99,-99,3,rep(-99,9),1,2,2,0,1,-99)
x
[1]   1   1   2   1   3 -99 -99 3 -99 -99 -99 -99 -99 -99 -99 -99 -99   1   2   2   0   1 -99

I want to remove consecutive values, but only if the consecutive values are more than a threshold, for example 5. So in this case the output result should be:

[1]   1   1   2   1   3 -99 -99 3  1   2   2   0   1 -99

I know I must use rle and maybe diff to do this, but I cannot figure out how to do this efficiently.

I do not think the suggested duplicate questions are actually duplicates, as in this case finding and removing only a subset of the values is the main part of the question. If this was not the case, using rle and duplicates would indeed be enough.

I've come up with this, but I'm sure there is a way better way, especially because this works only for the first instance of cuplicated values:

r <- rle(x)
toRemove <- which(r$lengths > 5)
startdupl <- sum(r$lengths[1:(toRemove-1)])+1
x[-(startdupl:(startdupl+r$lengths[toRemove]-1))]

The procedure instead should of course work for multiple duplicates of lengths >5.

Bonus points if I can replace the values with NA instead of removing them! Extra bonus points for using dplyr and/or making something faster than the functions below!

After a couple of good suggestions, here are some options I am considering and a small benchmark for a vector of 30000 elements:

f1 <- function(x) { inverse.rle(within.list(rle(x), values[lengths>5] <- NA))}
f2 <- function(x) {
  r <- rle(x)
  r$values[which(r$lengths>5)] <- NA
  with(r, rep(values, lengths))
}
f3 <- function(x) {as.vector(unlist(sapply(split(x, cumsum(c(1, 
diff(x) != 0))), function(i) replace(i, length(i) >= 5, NA))))}
f4 <- function(x) {do.call(c, sapply(split(x, cumsum(c(1, diff(x) != 0))), function(i) replace(i, length(i) >= 5, NA)))}

Results:

library(microbenchmark)
microbenchmark(f1(x), f2(x), f3(x), f4(x))
Unit: microseconds
  expr       min         lq       mean    median         uq       max neval
 f1(x)   559.445   602.3215   770.5779   652.395   660.6705  13108.82   100
 f2(x)   542.203   560.0705   882.0940   611.087   618.6395  14982.19   100
 f3(x) 50513.630 55523.6960 59338.0722 57408.724 60003.0870 145707.49   100
 f4(x) 52599.398 57648.0445 60722.3351 60098.227 62113.3655 124074.32   100

@akrun the main problem here is to find the consecutive values. Filtering for 5 or more is not really an issue. I was going to post `as.vector(unlist(sapply(split(x, cumsum(c(1, diff(x) != 0))), function(i) replace(i, length(i) >= 5, NA))))` but found the dupe so I didn't — Sotos, Jun 27 '17 at 11:32
@AF7 so you could not infer the answer of your question from the dupes? You can see my suggestion in comments which I did not post as I found it the same (minus the replace part) as 1 of the dupes. If you don't think so I will remove the dupes and add an answer — Sotos, Jun 27 '17 at 11:44
@Sotos, I could not, maybe because of a limitation of my intellect :) Anyhow I added benchmarks, they are quite interesting. — AF7, Jun 27 '17 at 11:48

akrun · Accepted Answer · 2017-06-27T10:05:23.500

2

We can create a logical index to subset both the values and lengths

with(rle(x), rep(values[lengths<=5], lengths[lengths<=5]))
#[1]   1   1   2   1   3 -99 -99   3   1   2   2   0   1 -99

If we want to replace the elements that have lengths greater than 5 to NA

 inverse.rle(within.list(rle(x), values[lengths>5] <- NA))
 #[1]   1   1   2   1   3 -99 -99   3  NA  NA  NA  NA  NA  NA  NA  NA  NA   1   2   2   0   1 -99

edited Jun 27 '17 at 10:05

answered Jun 27 '17 at 09:57

akrun

874,273
37
540
662

Awesome! What if instead of removing the data, I want to replace it with NA? I'm not used to using `with` so it's hard for me to grasp that's going on here. – AF7 Jun 27 '17 at 10:01
@AF7 Updated the post – akrun Jun 27 '17 at 10:05
1

thanks. I edited my question to add a microbenchmark. Your function and another function I devised (starting from yours) are more or less on par in speed. – AF7 Jun 27 '17 at 11:50

score 2 · Answer 2 · answered Jun 27 '17 at 11:51

2

Here is another way to do this,

do.call(c, lapply(split(x, cumsum(c(1, diff(x) != 0))), function(i) 
                                                        replace(i, length(i) >= 5, NA)))

# 11  12   2   3   4  51  52   6  71  72  73  74  75  76  77  78  79   8  91  92  10  11  12 
#  1   1   2   1   3 -99 -99   3  NA  NA  NA  NA  NA  NA  NA  NA  NA   1   2   2   0   1 -99

answered Jun 27 '17 at 11:51

Sotos

51,121
6
32
66

1

Thanks, I also added this to the possible functions in the benchmarks. – AF7 Jun 27 '17 at 11:54

Remove consecutive duplicates from a vector, only if more than 5 consecutive

After a couple of good suggestions, here are some options I am considering and a small benchmark for a vector of 30000 elements:

2 Answers2