So i have a data set with 48 obs and around 200 variables, my first column is my Date and the rest of the 199 variables are my x variables. So before I run my regression I would like to standardized them and remove outliers.
A simple version would be this so you get an idea :
data_final<- data.frame(
Date1 =seq(as.Date('2017-01-01'), as.Date('2017-04-01'), by = 'months'),
A = c(622,512,800,729),
B = c(1,2,1,3),
C = c(1,0,0,0),
D = c(NA, NA, 0.3,0.2),
E = c(300,200,100,200))
So I can find the SD of each column by doing:
dataSD<-data.frame(datafinal="sD",t(apply(datafinal[,-1],2,sd,na.rm=TRUE)))
And also standardize it with mean=0, sd=1 by using scale:
scale <- data.frame(Date = datafinal$Date1, scale(datafinal[2:ncol(datafinal)]))
Which all works, however, I want to see which outliers and any abnormal values in each of the 199 variables. More specifically I want to see which column has values that's 3 SD above its column mean.
Is there any way or suggestions you guys have to find a list or subset out these variables?
I'm thinking about something like subsetting:
[(abs(datafinal[2:ncol(datafinal)] - median(datafinal[2:ncol(datafinal)])) > 3*sd(datafinal[2:ncol(datafinal)]))])
But I'm not sure if it's the best way or if it works. I appreciate any inputs! Thanks in advance!
All the best, Michael