Good Afternoon R wizards,
I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R
The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.
I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)
Winsor75 <- function(x) {
Max <- quantile(x, probs = .75)
return(Max)
}
y <- as.data.frame(x)
y$a[y$a > Winsor75(x)] <- Winsor75(x)
The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.
Any suggestions?
Thanks!