How to find correlations in a dataset containing over 350 columns in R

Question

I have a dataset with ~360 measurement types listed as columns and has 200 rows each with unique ID.

+-----+-------+--------+--------+---------+---------+---------+---+---------+
|     |  ID   |   M1   |   M2   |   M3    |   M4    |   M5    | … |   M360   |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1   | 6F0ZC | 0.068  | 0.0691 | 37.727  | 42.6139 | 41.7356 | … | 44.9293 |
| 2   | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3   | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262  | … | 45.7038 |
| 4   | 6F108 | 0.0685 | 0.069  | 38.3026 | 43.5699 | 42.3    | … | 46.1701 |
| 5   | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6   | 6F19W | 0.0682 | 0.071  | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7   | 6F1A0 | 0.0681 | 0.069  | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8   | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| .   | .     | .      | .      | .       | .       | .       | . | .       |
| .   | .     | .      | .      | .       | .       | .       | . | .       |
| .   | .     | .      | .      | .       | .       | .       | . | .       |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438  | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528  | 44.606  | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+

I am trying to find correlations within these measurements and check for highly correlated features and visualize them. With so many columns, I am not able to do the regular correlation plots. (chart.Correlation,corrgram,etc..)

I also tried using qgraph but the measurements get cluttered at one place and is not very intuitive.

library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"), 
       layout="spring",
       label.cex=0.9,  
       minimum = 0.90,
       label.scale=FALSE)

Is there a good approach to visualize it & tell how these measurements are correlated with each other?

look at that : http://stackoverflow.com/questions/31735920/facetting-in-ggplot2 — HubertL, Aug 31 '15 at 21:39
I am not sure how that solves my problem. I have over 360 columns and he had only 3 columns and hence did facetting. — Sharath, Aug 31 '15 at 21:49
are you hoping for a visual only solution? I'm skeptical you'll find that with so many columns — Liz Young, Aug 31 '15 at 21:52
Not necessarily. If I am able to see just the numbers, it should be fine. All I want is to see how these measurements are correlated with each other and check for highly correlated ones and group them may be and do lot more interesting stuff. — Sharath, Aug 31 '15 at 21:55
The question of how to handle situations where the number of predictors exceeds the number of cases comes up repeatedly in genetics and there is an entire class of statistical research devoted to it. You should get so advice from someone with experience dealing with this issue. — IRTFM, Aug 31 '15 at 22:50

score 2 · Accepted Answer · answered Aug 31 '15 at 22:53

As mentioned in a comment, corrplot(...) might be a good option. Here is a ggplot option that does something similar. The basic idea is to draw a heat map, where color represents the correlation coefficient.

# create artificial dataset - you have this already
set.seed(1)   # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))

# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
  geom_tile()+
  scale_fill_gradientn(colours=rev(heat.colors(10)))
  coord_fixed()

# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
  geom_tile()+
  scale_fill_manual(values=rev(heat.colors(5)))+
  coord_fixed()

Note the diagonal. This is by design - the contrived data is set up so that rows i and i+1 are perfectly correlated, for every other row.

How to find correlations in a dataset containing over 350 columns in R

1 Answers1