1

I have a data.table in which I want to detect the presence of outliers (according to skewness and kurtosis) and, if found, correct them.

To this purpose, when an outlier is detected and var is the highest value, I want to set the highest value in var to be equal to the second highest. Below is a minimal (almost) working example of my code:

`%>%` <- fastpipe::`%>>%`

country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)

melted_data <- data.table(country, year, var)

melted_data %>%
  .[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
  .[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
  .[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%

  # Ranks
  .[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
  .[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year]  %>%

  # Identify and correct outliers
  .[rank_high_first==1, highest1 := var, by=year] %>%
  .[rank_high_first==2, highest2 := var, by=year] %>%
  .[rank_low_first==1, lowest1 := var, by=year] %>%
  .[rank_low_first==2, lowest2 := var, by=year] %>%
  .[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]

What I am trying to achieve is all in the last row. However, this does not work because the values highest1 and highest2 do not span the whole year group (edit: see also screenshot below). I think the solution would be to modify the following lines

.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%

so that highest1 and highest2 are copied to all rows in that year. How can I achieve that? I also tried the following, which did not work:

.[, highest1 := var[rank_high_first==1], by=year]

screenshot

Nimantha
  • 6,405
  • 6
  • 28
  • 69
gicanzo
  • 69
  • 8

1 Answers1

0

In the identify and correct outliers section, I believe you can make the following change:

f <- function(v,r,i) v[r==i & !is.na(r)]
melted_data[, `:=`(
  highest1 = f(var,rank_high_first,1),highest2=f(var,rank_high_first,2),
  lowest1 = f(var,rank_low_first,1),lowest2=f(var,rank_low_first,2)
),by=year]

Also, I was wondering about your initial definition of outliers. Should the parentheses be added as below?:

  .[, outliers := kurt>1 | (kurt>3.5 & abs(skew)>2), by=year]

The problem you are seeing in the last two lines, I believe can be solved by:

  .[outliers==TRUE & skew>0 & var==highest1, var:=highest2] %>%
  .[outliers==TRUE & skew<0 & var==lowest1, var:=lowest2]

Note: You do not need by=year here, and you should use := instead of <-

langtang
  • 22,248
  • 1
  • 12
  • 27
  • thanks! But I don't exactly understand the syntax in your first two rows. if `r` is `var`, `i` should be the highest value of `var` or... ? – gicanzo Apr 26 '22 at 13:39
  • oh and yes, there should be parentheses as you indicated (and the first part should be kurt>10, but just for the sake of having "outliers" in this little example..) – gicanzo Apr 26 '22 at 13:46
  • @gicanzo, sorry if my function, `f()` is a bit opaque. The `max` is probably a bit misleading. `v` is `var`, `r` is the ranking column, and `i` is the value of ranking column (say 1, or 2). When creating `highest1`, I'm sending var, rank_high_first, and 1 to the function. The function can be re-written/improved like this: `f <- function(v,r,i) v[r==i & !is.na(r)]`, although it is the same result. Both approaches remove any NAs, and return a single value. – langtang Apr 26 '22 at 14:11
  • I've updated the function, `f()` for improved clarity – langtang Apr 26 '22 at 14:12
  • Finally, I'm not sure why you are using `<-` in the final line. I've offered an alternative approach which should solve your issue – langtang Apr 26 '22 at 14:16