6

I have 22 variables, and I'd like to get the correlation scores, not as a matrix of correlation, but in a data frame, by pairs...

I mean... Not like this

    v1  v2  v3  v4
v1  1   x   x   x
v2  x   1   x   x
v3  x   x   1   x
v4  x   x   x   1

but like this:

var1  var2 cor
v1    v2   x
v1    v3   x
v1    v4   x
v2    v3   x
v2    v4   x
v3    v4   x

I'm new to R and I have been researching a lot, and I end up with a code that, sincerely, Is not efficient at all... My code creates a huge data frame with all the possible combinations for 22 variables (which is 4194304 combinatios... a lot!!! ) ... And then the code assigns the correlations just for the first 211 rows, which are the combinations with only 2 variables... Then I exclude everything I'm not interested in. Well... I get what I need. But I'm sure this is a very dumb way to do this and I'd like to learn a better way... Any tips?

My code:

#Getting the variable names from the data frame
av_variables<-variable.names(data.1)

#Creating a huge data frame for all possible combinations
corr_combinations <- as.data.frame(matrix(1,0,length(av_variables)))
for (i in 1:length(av_variables)){
  corr_combinations.i <- t(combn(av_variables,i))
  corr_combinations.new <- as.data.frame(matrix(1,length(corr_combinations.i[,1]),length(av_variables)))
  corr_combinations.new[,1:i] <- corr_combinations.i
  corr_combinations <- rbind(corr_combinations,corr_combinations.new)

#How many combinations for 0:2 variables?
comb_par_var<-choose(20, k=0:2)
##211

#A new column to recieve the values
corr_combinations$cor <- 0


  #Getting the correlations and assigning to the empty column
 for (i in (length(av_variables)+1):(length(av_variables)+ sum(comb_par_var) +1)){
  print(i/length(corr_combinations[,1]))
  corr_combinations$cor[i] <- max(as.dist(abs(cor(data.1[,as.character(corr_combinations[i,which(corr_combinations[i,]!=0&corr_combinations[i,]!=1)])]))))
  # combinations$cor[i] <- max(as.dist(abs(cor(data.0[,as.character(combinations[i,combinations[i,]!=0&combinations[i,]!=1])]))))
  }

#Keeping only the rows with the combinations of 2 variables
corr_combinations[1:(length(av_variables)+ sum(comb_par_var) +2),21]
corr_combinations<-corr_combinations[1:212,]
corr_combinations<-corr_combinations[21:210,]

#Keeping only the columns var1, var2 and cor
corr_combinations<-corr_combinations[,c(1,2,21)]

#Ordering to keep only the pairs with correlation >0.95, 
#which was my purpose the whole time
corr_combinations <- corr_combinations[order(corr_combinations$cor),]
corr_combinations<-corr_combinations[corr_combinations$cor >0.95, ] 
}
Thai
  • 493
  • 5
  • 15
  • 1
    you can use `reshape2::melt` on the correlation matrix (set the `upper.tri` to `NA` before melting if you only want the lower corr matrix) . https://stackoverflow.com/questions/41793219/correlation-p-values-of-all-combinations-of-all-rows-of-two-matrices/41794556#41794556 gives a rough idea – user20650 Aug 22 '17 at 19:39
  • 1
    Thank you so much! that's exactly what I was looking for! I did read a lot of questions about correlation and combinations here, but I didn't have found this specific one! – Thai Aug 22 '17 at 19:49

3 Answers3

7

You can calculate the full correlation matrix in one go. Then you just need to reshape. An example,

cr <- cor(mtcars)
# This is to remove redundancy as upper correlation matrix == lower 
cr[upper.tri(cr, diag=TRUE)] <- NA
reshape2::melt(cr, na.rm=TRUE, value.name="cor")
user20650
  • 24,654
  • 5
  • 56
  • 91
4

One base R alternative is to use matrix subsetting on the row/column names that are pulled together with combn.

# get pairwise combination of variable names
vars <- t(combn(colnames(myMat), 2))

# build data.frame with matrix subsetting
data.frame(vars, myMat[vars])
  X1 X2 myMat.vars.
1 V1 V2   0.8500071
2 V1 V3  -0.2828288
3 V1 V4  -0.2867921
4 V2 V3  -0.2698210
5 V2 V4  -0.2273411
6 V3 V4   0.9962044

You can add column names in one line as well using setNames.

setNames(data.frame(vars, myMat[vars]), c("var1", "var2", "corr"))

data

set.seed(1234)
myMat <- cor(matrix(rnorm(16), 4, dimnames=list(paste0("V", 1:4), paste0("V", 1:4))))
myMat
           V1         V2         V3         V4
V1  1.0000000  0.8500071 -0.2828288 -0.2867921
V2  0.8500071  1.0000000 -0.2698210 -0.2273411
V3 -0.2828288 -0.2698210  1.0000000  0.9962044
V4 -0.2867921 -0.2273411  0.9962044  1.0000000
lmo
  • 37,904
  • 9
  • 56
  • 69
  • Ohhhh, I got it! I was using `mycor<-as.data.frame(combn(colnames(myMat), 2))` , but that would give a df with 2 observations of 2 hundred variables, and I couldn't transpose! So, this is how you do! Thank you for your help, I learnt a lot! – Thai Aug 22 '17 at 20:14
  • 1
    Sure thing. Just note that `combn(colnames(myMat), 2)` creates a matrix here. Such an object is ideal for transposing with `t`. It is important to distinguish matrices from data.frames because they can have different behavior. However, matrices can be easily converted to data.frames with `data.frame` as above or using `as.data.frame`. However, since we are also adding the correlation values, we need to use `data.frame` to perform the coercion. – lmo Aug 22 '17 at 20:19
1

You can use tidyr to reshape the correlation matrix.

First, create a correlation matrix:

> d <- data.frame(x1=rnorm(10),
+                 x2=rnorm(10),
+                 x3=rnorm(10))
> x <- cor(d) # get correlations (returns matrix)
> x
           x1         x2         x3
x1  1.0000000  0.3096685 -0.5358578
x2  0.3096685  1.0000000 -0.7497212
x3 -0.5358578 -0.7497212  1.0000000

Then, use tidyr to reshape:

> y <- as.data.frame(x)
> y$var1 <- row.names(y)
> library(tidyr)
> gather(data = y, key = "var2", value = "correlation", -var1)
  var1 var2 correlation
1   x1   x1   1.0000000
2   x2   x1   0.3096685
3   x3   x1  -0.5358578
4   x1   x2   0.3096685
5   x2   x2   1.0000000
6   x3   x2  -0.7497212
7   x1   x3  -0.5358578
8   x2   x3  -0.7497212
9   x3   x3   1.0000000