Remove highly correlated variable and keep the low correlated

Question

I have a dataset "rf1" of 845 features and 1052 rows and want to eliminate, in order to do ML, the highly correlated features. I made this code but it shows me features and correlations without eliminate them...

`corr_simple<-function(rf1,sig=0.9)
{df_cor <- rf1 %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
corr<-cor(df_cor)
corr[lower.tri(corr,diag=TRUE)] <- NA 
corr[corr == 1] <- NA 
corr <- as.data.frame(as.table(corr))
corr <- na.omit(corr) 
corr <- subset(corr, abs(Freq) > sig) 
corr <- corr[order(-abs(corr$Freq)),] 
print(corr)
mtx_corr <- reshape2::acast(corr, Var1~Var2,value.var="Freq")}
corr_simple(rf1)`

here is the result but I want to eliminate the variables with a threshold of 0.9 MY RESULTS

When I use functions found here like this one I have an error message like this :

`data<-data.frame(rf1)
cor_matrix <- cor(data)
cor_matrix_rm <- cor_matrix                 
cor_matrix_rm[upper.tri(cor_matrix_rm)] <- 0
diag(cor_matrix_rm) <- 0
cor_matrix_rm
data_new <- data[ , !apply(cor_matrix_rm, 2, function(x) any(x > 0.90))]
Error in [.data.frame(data, , !apply(cor_matrix_rm, 2, function(x) any(x >  : 
  undefined columns selected`

I searched and tried other solutions but always this problem...

Can you make your post [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? — jrcalabrese, Feb 09 '23 at 18:21

score 1 · Accepted Answer · answered Feb 09 '23 at 18:28

You could do it with a loop. Here's an example using mtcars. You set the threshold to r_threshold (.8 in the example below). You loop over the columns of mtcars, each time removing the columns that have an absolute value of the correlation about the pre-defined threshold. After the relevant columns have been removed, it moves on to the next column, leaving the ones that have not been removed in previous steps. Notice that cyl, disp and wt have been removed (you can see this by the difference in the column names before and after the loop.

data(mtcars)
colnames(mtcars)
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

r_threshold <- .8
keep_going <- TRUE
i <- 1
while(keep_going){
  s <- seq(i+1, ncol(mtcars))
  r <- cor(mtcars[,s], mtcars[,i])
  if(any(abs(r) > r_threshold)){
    mtcars <- mtcars[, -s[which(abs(r) > r_threshold)]]
  }
  i <- i+1
  if(ncol(mtcars) <= i){
    keep_going <- FALSE
  }
}
colnames(mtcars)
#> [1] "mpg"  "hp"   "drat" "qsec" "vs"   "am"   "gear" "carb"

^{Created on 2023-02-09 by the reprex package (v2.0.1)}

Thanks but when I run the code I have another error : Error in if (any(abs(r) > r_threshold)) { : missing value where TRUE/FALSE needed — NDe, Feb 09 '23 at 19:09
Are any of the correlations in your correlation matrix missing? Use `use="pair"` as an argument to `cor()` and that will use pairwise instead of list wise deletion, which should solve the problem. — DaveArmstrong, Feb 09 '23 at 19:21
it works thank you Dave !!! I'm sorry for the questions, I'm just a beginner learning the R language... — NDe, Feb 09 '23 at 19:25

Remove highly correlated variable and keep the low correlated

1 Answers1