How to standardize data in r before running regressions in r?

Question

So i have a data set with 48 obs and around 200 variables, my first column is my Date and the rest of the 199 variables are my x variables. So before I run my regression I would like to standardized them and remove outliers.

A simple version would be this so you get an idea :

data_final<- data.frame(
      Date1 =seq(as.Date('2017-01-01'), as.Date('2017-04-01'), by = 'months'),
      A = c(622,512,800,729),
      B = c(1,2,1,3),
      C = c(1,0,0,0),
      D = c(NA, NA, 0.3,0.2),
      E = c(300,200,100,200))

So I can find the SD of each column by doing:

dataSD<-data.frame(datafinal="sD",t(apply(datafinal[,-1],2,sd,na.rm=TRUE)))

And also standardize it with mean=0, sd=1 by using scale:

scale <- data.frame(Date = datafinal$Date1, scale(datafinal[2:ncol(datafinal)]))

Which all works, however, I want to see which outliers and any abnormal values in each of the 199 variables. More specifically I want to see which column has values that's 3 SD above its column mean.

Is there any way or suggestions you guys have to find a list or subset out these variables？

I'm thinking about something like subsetting:

[(abs(datafinal[2:ncol(datafinal)] - median(datafinal[2:ncol(datafinal)])) > 3*sd(datafinal[2:ncol(datafinal)]))])

But I'm not sure if it's the best way or if it works. I appreciate any inputs! Thanks in advance！

All the best, Michael

score 2 · Answer 1 · answered Aug 17 '20 at 16:47

You could calculate colMeans and colSdColMeans and create a matrix based on 3*SDs.

means <- colMeans(data_final[-1], na.rm=TRUE)
sds <- colSdColMeans(data_final[-1])
(xd <- as.data.frame(t(sapply(c(-1, 1), function(x) x*3*sds*means))))
#           A         B      C           D         E
# 1 -251518.1 -5.026492 -0.375 -0.05303301 -48989.79
# 2  251518.1  5.026492  0.375  0.05303301  48989.79

Then, in an mapply look up if any of the column values are not between those ranges.

mapply(function(x, y) any(sapply(x, function(z)
  !data.table::between(z, y[1], y[2])), na.rm=TRUE), data_final[-1], xd)
#     A     B     C     D     E 
# FALSE FALSE  TRUE  TRUE FALSE

Note, that if there are columns with all NA we probably need take special care by exception handling.

Great idea! And thanks for the heads up, I will subset the columns with NA out first! Thanks jay! — Michael, Aug 17 '20 at 23:09

Dealec · Accepted Answer · 2020-08-17T19:16:17.240

Here are a couple of options to create a logical index indicating whether or not an outlier is detected. I've considered an outlier to be greater than 3 sds above the mean, as you have indicated (not checking for "lesser than").

Using apply, specifying the argument margin=2 to apply the function column by column.

numericaldata <- datafinal[2:ncol(datafinal)]

apply(numericaldata, margin=2, function(x) {
  x>3 * sd(x, na.rm = TRUE) + mean(x, na.rm = TRUE)
})%>%
cbind.data.frame(Date1=datafinal$Date1,.)

Using dplyr:

datafinal%>%
  mutate(across(.cols = 2:ncol(.), ~ . > 3 * sd(., na.rm = TRUE) + mean(., na.rm = TRUE)))

Results:

       Date1     A     B     C     D     E
1 2017-01-01 FALSE FALSE FALSE    NA FALSE
2 2017-02-01 FALSE FALSE FALSE    NA FALSE
3 2017-03-01 FALSE FALSE FALSE FALSE FALSE
4 2017-04-01 FALSE FALSE FALSE FALSE FALSE

Should you want to know how many outliers were found, you can summarise the results like so:

results%>%
  summarise(across(2:ncol(.), ~sum(., na.rm=TRUE)))

You are right, 3 sd above not below , sorry for the typo! This is wonderful I love using dplyr, very clear and helpful! Thanks Dealec for you help! Much appreciated — Michael, Aug 17 '20 at 23:11
Happy to help ! Yes, the new dplyr release (1.0) is pretty awesome. — Dealec, Aug 18 '20 at 02:23

How to standardize data in r before running regressions in r?

2 Answers2