1

I have a panel data frame similar to this:

df <- data.frame(
  year = c(2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2012L, 2013L, 2014L, 2015L,
           2016L, 2017L),
  id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
  c = c(7.8L, 8.1L, 51L, 8.2L, 9L, 10L, 7.8L, 8.1L, 8.2L, 0.1L, 9.5L, 10L)
)

df
#    year id    c
# 1  2012  1  7.8
# 2  2013  1  8.1
# 3  2014  1 51.0
# 4  2015  1  8.2
# 5  2016  1  9.0
# 6  2017  1 10.0
# 7  2012  2  7.8
# 8  2013  2  8.1
# 9  2014  2  8.2
# 10 2015  2  0.1
# 11 2016  2  9.5
# 12 2017  2 10.0

I have certain typo errors in my data where there are huge obvious jumps or declines in the data similar to 51 and 0.1 in column c. I would like to be able to identify these errors within each group and adjust them by taking an average of the value before and after the error. I would like to define these error jumps or declines as anything that is at least more than four or less than a quarter of the surrounding values.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Erwin Rhine
  • 303
  • 2
  • 11

1 Answers1

1

Identify values outside 1.5*(inter quartile range) as outliers using IQR(), set them NA, and linear interpolate using approx. ave does this group-wise.

outfun <- function(x) {
  q <- quantile(x, c(1/4, 3/4))
  bounds <- c(q[1] - IQR(x) * 1.5, IQR(x) * 1.5 + q[2])
  out <- x < bounds[1] | x > bounds[2]
  x[out] <- NA
  approx(seq(x), x, seq(x))$y
}

df <- transform(df, c.new=ave(c, id, FUN=outfun))
df
#    year id    c c.new
# 1  2012  1  7.8  7.80
# 2  2013  1  8.1  8.10
# 3  2014  1 51.0  8.15
# 4  2015  1  8.2  8.20
# 5  2016  1  9.0  9.00
# 6  2017  1 10.0 10.00
# 7  2012  2  7.8  7.80
# 8  2013  2  8.1  8.10
# 9  2014  2  8.2  8.20
# 10 2015  2  0.1  8.85
# 11 2016  2  9.5  9.50
# 12 2017  2 10.0 10.00

If there are missings in the data we simply could set na.rm=TRUE in quantile() and IQR(). However, to avoid that these are also interpolated in the result (might be a feature though), we want to recover them by simply identifying them beforehand. Unfortunately, this doesn't seem to work with ave, but we can use by.

outfun2 <- function(x) {
  na <- is.na(x)
  q <- quantile(x, c(1/4, 3/4), na.rm=TRUE)
  bounds <- c(q[1] - IQR(x, na.rm=TRUE) * 1.5, IQR(x, na.rm=TRUE) * 1.5 + q[2])
  out <- x < bounds[1] | x > bounds[2]
  x[which(out)] <- NA
  res <- approx(seq(x), x, seq(x))$y
  res[na] <- NA
  res
}

df[8, 3] <- NA  ## produce missing

df <- do.call(rbind, by(df, df$id, function(x) transform(x, c.new=outfun2(x$c))))
df
#      year id    c
# 1.1  2012  1  7.8
# 1.2  2013  1  8.1
# 1.3  2014  1 51.0
# 1.4  2015  1  8.2
# 1.5  2016  1  9.0
# 1.6  2017  1 10.0
# 2.7  2012  2  7.8
# 2.8  2013  2   NA
# 2.9  2014  2  8.2
# 2.10 2015  2  0.1
# 2.11 2016  2  9.5
# 2.12 2017  2 10.0
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks a lot for the response. In my actual dataset where I have more than 80k ids the length could be different for each id so each one could start and end at different year. When I try to apply your code I get the following error: Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ – Erwin Rhine Jan 08 '21 at 17:37
  • You're welcome @ErwinRhine , in this case would you mind to provide a minimal data set that mimics your original data? Please follow the [guidelines for R](https://stackoverflow.com/a/5963610/6574038) of the community, cheers. – jay.sf Jan 08 '21 at 18:53