how to generate grouping variable based on correlation?

Question

 library(magrittr)
 library(dplyr)
 V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
 V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
 cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)


 df <- data.frame(V1,V2,cor)

 # exclude rows where cor=NA
 df <- df[complete.cases(df)==TRUE,]

This is the full data frame, cor=NA represents a correlation smaller than 0.8

 df

   V1 V2 cor
1   A  A 1.0
2   A  B 0.8
7   B  A 0.8
8   B  B 1.0
15  C  C 1.0
16  C  D 0.8
21  D  C 0.8
22  D  D 1.0
29  E  E 1.0
30  E  F 0.9

In the above df, F is not in V1, meaning that F is not of interest

so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)

 V1.LIST <- unique(df$V1)
 df.gp <- df[which(df$V2 %in% V1.LIST),]

 df.gp

   V1 V2 cor
1   A  A 1.0
2   A  B 0.8
7   B  A 0.8
8   B  B 1.0
15  C  C 1.0
16  C  D 0.8
21  D  C 0.8
22  D  D 1.0
29  E  E 1.0

So now, df.gp is the dataset I need to work on

I drop the unused level in V2 (which is F in the example)

 df.gp$V2 <- droplevels(df.gp$V2)

I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group

By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.

Therefore, the example here should have three groups.

Does that makes sense given that you don't show any correlation coeff. for `A - C` , `B - C`, `A - D`, etc... ? — Sotos, Jul 29 '16 at 07:44
Don't you mean >.8, not >=.8, since otherwise they are all in the same group? And don't all variables necessarily correlate perfectly with themselves? — shayaa, Jul 29 '16 at 08:01
Actually I could show that, but the real dataset is download from a genetic variants website, and normally I am only interested in those pairs with a correlation coefficient>= 0.8. In the sample data frame, A and B are in the same group, C and D are in the same group. No correlation (>=0.8) between A-C, A-D, B-C, B-D. — cyrusjan, Jul 29 '16 at 08:20
You might try looking into `hclust`/`cutree` like [here](http://stackoverflow.com/questions/6518133/clustering-list-for-hclust-function) -- e.g. `cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df))), h = 0.8)` — alexis_laz, Jul 29 '16 at 09:19
I edited my answer to reflect my interpretation of your question — shayaa, Jul 30 '16 at 01:10

shayaa · Accepted Answer · 2016-07-30T05:28:40.600

The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.

library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat

After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.

a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a

  X1 X2 var
1  1  2   A
2  1  2   B
3  3  4   C
4  3  4   D
5  5  6   E

Now either X1 or X2 determines your unique groupings.

Edited by cyrusjan:

The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.

Contributed by alexis_laz:

By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.

 cor.gp <- data.frame(cor.gp =
      cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))

Thank you. But it seems that this does not work very well if I have actually three or more groups — cyrusjan, Jul 29 '16 at 09:00
Can you edit your question with a data example which minimally represents your own data and can you comment on what isn't working well. How do you know the groups a priori if you are assigning them based on cutting the numeric variable cor? — shayaa, Jul 29 '16 at 09:02
I have revised my question and example. Thank you for your suggestion. — cyrusjan, Jul 29 '16 at 09:53

how to generate grouping variable based on correlation?

1 Answers1

Edited by cyrusjan:

Contributed by alexis_laz: