0

I'm working with a large data set which I suspect have multicollinearity issues because var-covariance matrix has a negative eigenvalue (and really small when comparing to the rest); also ratio max eigenvalue/min eigenvalue > 3000;

My question is: is there any test routine in R just to identify what variables are redundant (I don't work with regression models); I might do linear regression pair graphs or use the pairs(data) command but I really appreciate some help with numerical tests because I have 200 variables and graphs aren't a very good decision support in this matter.

Thomas
  • 43,637
  • 12
  • 109
  • 140
Maria D
  • 113
  • 2
  • 5
  • Generally, to obtain useful feedback, you need to provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and show some of what you've already tried. – Thomas Jul 20 '13 at 13:07

1 Answers1

0

If I undesrtood correctly what you are looking for:

If you have in mind a correlation threshold you want to use to exclude some variables you could try the following

In the example here I'm generating a random matrix

> set.seed(3)
> data <- data.frame(V1=rnorm(20),V2=rnorm(20),V3=rnorm(20),V4=rnorm(20),V5=rnorm(20))
> cor.mat <- cor(data)
> diag(cor.mat)=0

This is the correlation matrix and your variables are V1, V2, V3, V4, V5

> cor.mat
            V1          V2         V3         V4         V5
V1  0.00000000 -0.14464568 0.09047839 -0.1200863 -0.1110384
V2 -0.14464568  0.00000000 0.04340839  0.1929009 -0.4354569
V3  0.09047839  0.04340839 0.00000000  0.1185795  0.1760463
V4 -0.12008631  0.19290090 0.11857953  0.0000000 -0.2080077
V5 -0.11103839 -0.43545694 0.17604633 -0.2080077  0.0000000

Now you substitute in the following loop, in the if statement, the threshold value you want to use to select your redundant variables (here I use .4 even if this does not indicate redundancy but is the highest value that came out from the random matrix).

> High_cor = vector()
> for (i in 1:nrow(cor.mat)){
+     for (j in 1:ncol(cor.mat)){
+        if (abs(cor.mat[i,j]) >= 0.4) {High_cor[i]=paste(rownames(cor.mat)[i], "-",
+                                                         colnames(cor.mat)[j])}
+ }
+ }
> High_cor <- High_cor[!is.na(High_cor)]

In this case the variables that correlate > .4 are V2 and V5:

> High_cor
[1] "V2 - V5" "V5 - V2"

Hope this helps

Alice
  • 191
  • 2
  • 13