1

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)

# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)

I personally thought of "cor" or "all.equal" and I did as follows, but did not work.

indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))

what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)

similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)

I would also love to see how then this is different from correlated. do they result in identical results ?

  • Do you mean similar as in _correlated_ or similar as in the difference in their values is within some threshold? It would be helpful if you could elaborate a bit more on what you mean by similar. – Alex A. Mar 10 '15 at 14:19
  • Elementwise? Would you consider 1,2,3,4 and 1.1,2.1,3.1,4.1 similar? How about 1,2,3,4 and 4,1,2,3? – statespace Mar 10 '15 at 14:24
  • Thanks to @Alex and others, I think it would not make so much difference to find those that highly correlated with those that are not different within some threshold. my main idea is to find those that highly similar within a threshold but I would love to see whether the results are different when we check for correlation or not. Anyway, I updated the question –  Mar 10 '15 at 14:27
  • I might be thinking it wrong, but I see a simple linear regression (treat columns as time series) and resulting summary output is exactly what you need. If order doesn't matter then sort them ascending beforehand. – statespace Mar 10 '15 at 14:28
  • 1
    I suggest you calculate the distance matrix. Start with `dist(t(Mat))`. – Roland Mar 10 '15 at 14:41
  • When you say 75% of column1-column2 are less than abs(0.5), do you mean the elementwise difference in columns? Like x[1]-y[1], x[2]-y[2], etc? – Alex A. Mar 10 '15 at 14:47
  • @Alex Yes elementwise . e.g. when I say 75% means 75% of elements residuals should be in within the threshold –  Mar 10 '15 at 14:50
  • That's helpful, thanks for clarifying. – Alex A. Mar 10 '15 at 14:50

3 Answers3

1

Using correlation you could try (with a simpler matrix for demonstration)

set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)

corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)

data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
  row col correlation
1   8   1   0.3387738
2   6   2   0.3350891

Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738

DatamineR
  • 10,428
  • 3
  • 25
  • 45
  • why then does it write it "row" and "col" if the row also means column :-p –  Mar 10 '15 at 17:22
  • I mean it is a by-product which results from using `which` with `arr.ind = TRUE` – DatamineR Mar 10 '15 at 18:05
  • @Student can you please also add the correlation coefficient plot of the maximum correlated columns with the values of correlation in it for example http://stackoverflow.com/questions/15887212/heatmap-or-plot-for-a-correlation-matrix –  Mar 10 '15 at 18:11
  • I think you could try it yourself, it is described under the link you are naming – DatamineR Mar 10 '15 at 18:38
0

I'd take linear regression approach:

Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
  sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))

And residual standard error as an optional criteria. You can further order data frame by sigma and get best/worst combinations.

statespace
  • 1,644
  • 17
  • 25
0

If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:

set.seed(1)

Mat <- matrix(runif(40000), ncol=100, nrow=400)

col.combs <- t(combn(1:ncol(Mat), 2))

similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)

# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
    i <- col.combs[k, 1]
    j <- col.combs[k, 2]

    # Difference within threshold?
    diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)

    pair.corr <- cor(Mat[, 1], Mat[, 2])

    if (mean(diff.thresh) > 0.75)
        similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}

In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.

> head(similar)
   Col1  Col2         Corr Pct.Diff
1     1     2 -0.003187894    76.75
2     1     3  0.074061019    76.75
3     1     4  0.082668387    78.00
4     1     5  0.001713751    75.50
5     1     8  0.052228907    75.75
6     1    12 -0.017921978    78.00

Perhaps it's not the best solution, but gets the job done.

Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.

Alex A.
  • 5,466
  • 4
  • 26
  • 56