1

So i have a data set with 48 obs and around 200 variables, my first column is my Date and the rest of the 199 variables are my x variables. So before I run my regression I would like to standardized them and remove outliers.

A simple version would be this so you get an idea :

data_final<- data.frame(
      Date1 =seq(as.Date('2017-01-01'), as.Date('2017-04-01'), by = 'months'),
      A = c(622,512,800,729),
      B = c(1,2,1,3),
      C = c(1,0,0,0),
      D = c(NA, NA, 0.3,0.2),
      E = c(300,200,100,200))

So I can find the SD of each column by doing:

dataSD<-data.frame(datafinal="sD",t(apply(datafinal[,-1],2,sd,na.rm=TRUE)))

And also standardize it with mean=0, sd=1 by using scale:

scale <- data.frame(Date = datafinal$Date1, scale(datafinal[2:ncol(datafinal)]))

Which all works, however, I want to see which outliers and any abnormal values in each of the 199 variables. More specifically I want to see which column has values that's 3 SD above its column mean.

Is there any way or suggestions you guys have to find a list or subset out these variables?

I'm thinking about something like subsetting:

[(abs(datafinal[2:ncol(datafinal)] - median(datafinal[2:ncol(datafinal)])) > 3*sd(datafinal[2:ncol(datafinal)]))])

But I'm not sure if it's the best way or if it works. I appreciate any inputs! Thanks in advance!

All the best, Michael

Michael
  • 59
  • 5

2 Answers2

2

You could calculate colMeans and colSdColMeans and create a matrix based on 3*SDs.

means <- colMeans(data_final[-1], na.rm=TRUE)
sds <- colSdColMeans(data_final[-1])
(xd <- as.data.frame(t(sapply(c(-1, 1), function(x) x*3*sds*means))))
#           A         B      C           D         E
# 1 -251518.1 -5.026492 -0.375 -0.05303301 -48989.79
# 2  251518.1  5.026492  0.375  0.05303301  48989.79

Then, in an mapply look up if any of the column values are not between those ranges.

mapply(function(x, y) any(sapply(x, function(z)
  !data.table::between(z, y[1], y[2])), na.rm=TRUE), data_final[-1], xd)
#     A     B     C     D     E 
# FALSE FALSE  TRUE  TRUE FALSE 

Note, that if there are columns with all NA we probably need take special care by exception handling.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    Great idea! And thanks for the heads up, I will subset the columns with NA out first! Thanks jay! – Michael Aug 17 '20 at 23:09
2

Here are a couple of options to create a logical index indicating whether or not an outlier is detected. I've considered an outlier to be greater than 3 sds above the mean, as you have indicated (not checking for "lesser than").

Using apply, specifying the argument margin=2 to apply the function column by column.

numericaldata <- datafinal[2:ncol(datafinal)]

apply(numericaldata, margin=2, function(x) {
  x>3 * sd(x, na.rm = TRUE) + mean(x, na.rm = TRUE)
})%>%
cbind.data.frame(Date1=datafinal$Date1,.) 

Using dplyr:

datafinal%>%
  mutate(across(.cols = 2:ncol(.), ~ . > 3 * sd(., na.rm = TRUE) + mean(., na.rm = TRUE)))

Results:

       Date1     A     B     C     D     E
1 2017-01-01 FALSE FALSE FALSE    NA FALSE
2 2017-02-01 FALSE FALSE FALSE    NA FALSE
3 2017-03-01 FALSE FALSE FALSE FALSE FALSE
4 2017-04-01 FALSE FALSE FALSE FALSE FALSE

Should you want to know how many outliers were found, you can summarise the results like so:

results%>%
  summarise(across(2:ncol(.), ~sum(., na.rm=TRUE)))
Dealec
  • 287
  • 1
  • 5
  • You are right, 3 sd above not below , sorry for the typo! This is wonderful I love using dplyr, very clear and helpful! Thanks Dealec for you help! Much appreciated – Michael Aug 17 '20 at 23:11
  • Happy to help ! Yes, the new dplyr release (1.0) is pretty awesome. – Dealec Aug 18 '20 at 02:23