0

I have dataset with values from 100 to 200, but there are a few spikes in data. I don't want to smooth the whole dataset with rollmean or rollaplly.

I want to work it in that way:

  1. find these spikes with condition (value > 300)

  2. replace these too big values with mean/median that had been calculated from 10 near neighbors values.

Example in pseudo-code:

data[n] = spike

data[n] = mean(from data[n-5] to data[n+5])

It's like using window function not on the whole data set, only on certain points in data.

Thank you in advance

steveb
  • 5,382
  • 2
  • 27
  • 36
  • Please add a reproducible example. What do you mean 'from 10 near neighbors'? – Sotos Apr 04 '16 at 09:06
  • 1
    what happens if you have 2 'spikes' one after another? or If the `n+5` `n-5` have spikes?- Please have a look at [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Sotos Apr 04 '16 at 09:15
  • median with enough window size should solve this – Dauren Sergazin Apr 04 '16 at 09:16
  • Thank you for the link, I will add a reproducible example as soon as i get to my laptop with R and Data. – Dauren Sergazin Apr 04 '16 at 09:19

1 Answers1

0

I like this question. A typical moving average/k-nearest neighbourhood estimation. A nonnparametric approach. The following should work.

foo <- function(x, thresh = 300, h = 5, window.fun = mean) {
  spikes.loc <- which(x > thresh)
  low.bound <- spikes - h
  up.bound <- spikes + h
  N <- length(spikes.loc)
  x.hat <- x
  for (i in 1:N) x.hat[spikes.loc[i]] <- window.fun(x[low.bound[i]:up.bound[i]])
  return(x.hat)
  }

This function takes your original observations vector x, threshold, a window size (a smoothing parameter) as well as a user-specified window function. The returned value is the vector smoothed data. It only differs from original data at spikes points. Common choice of a window function is density function, so you end up with weighted average of all neighbouring data.

Please note, I am assuming your data are evenly spaced, so a simple index x[i-h] : x[i+h] gives a reasonable neighbourhood. In more general setting, a window is based on euclidean distance, but will naively costs O(N*N), where N is the number of observations, which is expensive.

In R, there are built-in nonparametric estimation / smoothing tools. The most basic one is kernel smoothing, a generalization of moving average. It uses FFT algorithm for fast computation at O(N log(N)) costs. Please see ?ksmooth. More advanced are KernSmooth and sm packages.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248