I have a data.table in which I want to detect the presence of outliers (according to skewness and kurtosis) and, if found, correct them.
To this purpose, when an outlier is detected and var
is the highest value, I want to set the highest value in var
to be equal to the second highest. Below is a minimal (almost) working example of my code:
`%>%` <- fastpipe::`%>>%`
country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)
melted_data <- data.table(country, year, var)
melted_data %>%
.[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
.[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
.[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%
# Ranks
.[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
.[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year] %>%
# Identify and correct outliers
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
.[rank_low_first==1, lowest1 := var, by=year] %>%
.[rank_low_first==2, lowest2 := var, by=year] %>%
.[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]
What I am trying to achieve is all in the last row. However, this does not work because the values highest1
and highest2
do not span the whole year
group (edit: see also screenshot below). I think the solution would be to modify the following lines
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
so that highest1
and highest2
are copied to all rows in that year. How can I achieve that? I also tried the following, which did not work:
.[, highest1 := var[rank_high_first==1], by=year]