-1

Example Data: 2 4 6 10 99 150 14 15 45

Prior knowledge of the problem, I know that anything above 35 is an outlier but because the data depending on time, I would like to replace all value above 35 with the last known numbers that are under 35. The dataset contains over millions of rows so I need to automatically do this instead of replacing one by one.

Result needed: 2 4 6 10 10 10 14 15 15

Kody Bui
  • 33
  • 3
  • http://stackoverflow.com/help/how-to-ask, http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – timat Mar 26 '18 at 14:06

2 Answers2

3
x <- c(2, 4, 6, 10, 99, 150, 14, 15, 45)

#set outliers to NA
x[x > 35] <- NA

#fill NA values with Last Observation Carried Forward
library(zoo)
x <- na.locf(x)
#[1]  2  4  6 10 10 10 14 15 15
Roland
  • 127,288
  • 10
  • 191
  • 288
0

For those that don't want a zoo package dependency, here's a simple version using run level encoding from base. The idea is simple, we just use the rle() and replace the NAs in the index with whatever is to the left (i.e. preceding value), and we skip any first NAs because there is no value to the left. Then we use inverse.rle() to get back to a full length vector. For reverse operation (, we just reverse the vector before and after. I haven't done benchmarking, but since all operations are vectorized, it should be quite fast.

rle() doesn't group NAs for some reason. Docs state "Missing values are regarded as unequal to the previous value, even if that is also missing.". Hence why I recode the NAs to a temporary string value, and have to convert the vector back to the right class. Not perfect, but works for most cases.

#' Last observation carried forward
#'
#' @param x A vector
#' @param reverse Whether to do it in reverse
#'
#' @return A vector
#' @export
#'
#' @examples
#' c(NA, 1, NA, 2, NA) %>% locf()
#' c(NA, 1, NA, 2, NA) %>% locf(reverse = T)
locf = function(x, reverse = F) {
  #reverse?
  if (reverse) x = rev(x)

  #recode NA
  #these are kept distinct by rle() by default for same reason ???
  x_class = class(x)
  x[is.na(x)] = "___tmp"

  #run level encoding
  x_rle = rle(x)

  #swap values for NAs
  which_na = which(x_rle$values == "___tmp")

  #skip 1st
  which_na = setdiff(which_na, 1)

  #replace values
  x_rle$values[which_na] = x_rle$values[which_na - 1]

  #back to normal
  y = inverse.rle(x_rle)

  #NA recode
  y[y == "___tmp"] = NA

  #fix type/class
  if (x_class[1] == "logical") y = as.logical(y)
  if (x_class[1] == "integer") y = as.integer(y)
  if (x_class[1] == "numeric") y = as.double(y)
  if (x_class[1] == "factor") y = factor(y, levels = levels(x))
  if (x_class[1] == "ordered") y = ordered(y, levels = levels(x))

  #reverse?
  if (reverse) y = rev(y)

  y
}

Tests:

> c(NA, 1, NA, 2, NA) %>% rle()
Run Length Encoding
  lengths: int [1:5] 1 1 1 1 1
  values : num [1:5] NA 1 NA 2 NA
> c(NA, 1, NA, 2, NA) %>% rle() %>% str()
List of 2
 $ lengths: int [1:5] 1 1 1 1 1
 $ values : num [1:5] NA 1 NA 2 NA
 - attr(*, "class")= chr "rle"
> #swap the values for one to left
> #reverse rle
> c(NA, 1, NA, 2, NA) %>% locf()
[1] NA  1  1  2  2
> c(NA, 1, NA, 2, NA) %>% locf(reverse = T)
[1]  1  1  2  2 NA
> c(NA, 1, NA, 2, NA, NA, NA) %>% locf()
[1] NA  1  1  2  2  2  2
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89