10

While using princomp() function in R, the following error is encountered : "covariance matrix is not non-negative definite".

I think, this is due to some values being zero (actually close to zero, but becomes zero during rounding) in the covariance matrix.

Is there a work around to proceed with PCA when covariance matrix contains zeros ?

[FYI : obtaining the covariance matrix is an intermediate step within the princomp() call. Data file to reproduce this error can be downloaded from here - http://tinyurl.com/6rtxrc3]

384X21
  • 6,553
  • 3
  • 17
  • 17
  • Adding a sample input to make the the problem reproducible is useful to answerers. – Richie Cotton Dec 19 '11 at 13:52
  • 1
    If you look at `stats:::princomp.default` you'll see that the error occurs when you have negative eigenvalues in the covariance matrix. – Richie Cotton Dec 19 '11 at 13:57
  • @ Richie Cotton : I wish I can provide. My data is huge (10K x 10K) and I haven't figured out the part that is causing the error. I will be happy to know if there is a way in which I can extract troubling part of data and post it here ! – 384X21 Dec 19 '11 at 14:01
  • 1
    `cv <- matrix(c(1, 2, 2, 1), nrow = 2); princomp(covmat = cv)` reproduces the error. Don't know how relevant it is to your dataset. – Richie Cotton Dec 19 '11 at 14:40
  • Thanks ! I could reproduce it from a small portion of original matrix, 1K x 1K (file size 5.5 MB). I was wondering how to post it. I am sure some elements of covariance matrix will be zero (or close to it) as my input data has large chunks of identical values. – 384X21 Dec 19 '11 at 14:45
  • You can throw it up onto github or another free data hosting site. Google brings up many that claim to offer this service, I don't have any experience with them though so can't vouch one way or another. – Chase Dec 19 '11 at 14:52
  • @Chase: Thanks, I put up the file online . – 384X21 Dec 19 '11 at 15:06
  • @user1029725, it is not trivial to read the file using read.table. Do something like 'm = matrix(runif(4), nrow = 2, ncol = 2)' and then 'write.table(m, file = "bla") – Paul Hiemstra Dec 19 '11 at 15:12
  • Updated input file. It is a symmetric matrix, so only lower triangle is uploaded ! – 384X21 Dec 19 '11 at 15:38
  • To transform a triangular matrix to full matrix (as required for `princomp()` function), I usually replace `NA's` to `0's` first and then add the transpose of the matrix to original matrix (some thing like this - `matrix <- matrix + t(matrix)`) – 384X21 Dec 19 '11 at 15:46
  • It sounds like your matrix might be singular - i.e., some column is a linear combination of the others. Zeroes in the covariance matrix are a good thing, but those zeroes come from columns of data being independent. If, as you say, your data has large sections of identical values, then your covariance matrix will have large entries, not small ones. You might need to remove columns that are redundant before you can use princomp. – Wesley Dec 19 '11 at 16:50
  • @Wesley: It is quite possible that some of the columns are linear combination of others. But removing *redundant* columns is removing a variable from the data analysis, which is not desirable I guess. Anyways, I will give it a try. Btw, how to remove redundant columns ? – 384X21 Dec 19 '11 at 16:54
  • @DWin has some good suggestions in his answer below. – Wesley Dec 19 '11 at 16:58

1 Answers1

9

The first strategy might be to decrease the tolerance argument. Looks to me that princomp won't pass on a tolerance argument but that prcomp does accept a 'tol' argument. If not effective, this should identify vectors which have nearly-zero covariance:

nr0=0.001
which(abs(cov(M)) < nr0, arr.ind=TRUE)

And this would identify vectors with negative eigenvalues:

which(eigen(M)$values < 0)

Using the h9 example on the help(qr) page:

> which(abs(cov(h9)) < .001, arr.ind=TRUE)
      row col
 [1,]   9   4
 [2,]   8   5
 [3,]   9   5
 [4,]   7   6
 [5,]   8   6
 [6,]   9   6
 [7,]   6   7
 [8,]   7   7
 [9,]   8   7
[10,]   9   7
[11,]   5   8
[12,]   6   8
[13,]   7   8
[14,]   8   8
[15,]   9   8
[16,]   4   9
[17,]   5   9
[18,]   6   9
[19,]   7   9
[20,]   8   9
[21,]   9   9
> qr(h9[-9,-9])$rank  
[1] 7                  # rank deficient, at least at the default tolerance
> qr(h9[-(8:9),-(8:9)])$ take out only the vector  with the most dependencies
[1] 6                   #Still rank deficient
> qr(h9[-(7:9),-(7:9)])$rank
[1] 6

Another approach might be to use the alias function:

alias( lm( rnorm(NROW(dfrm)) ~ dfrm) )
IRTFM
  • 258,963
  • 21
  • 364
  • 487