Run corrplot to a data frame by group

Question

I have a data frame with columns that represent quantitative variables and one qualitative (groups).

The data frame has the same structure as this one:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I would like to apply the corrplot function (from the corrplot package) to the data by group.

Could anybody help me out?

I tried to do what was suggested below by user20650 and this is the result:

This is the tail of my dataframe:

structure(list(group = structure(c(4L, 4L, 4L, 4L, 4L, 4L), .Label = c("brooksi", 
"copianullum", "fulbrighti", "paratrygonyi"), class = "factor"), 
    total_length = c(17, 25, 15, 9, 22, 25), max_w = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), n_prog = c(NA, NA, NA, NA, 482L, 432L), ceph_pedun_L = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), bothrid_L = c(NA, 870, NA, NA, NA, NA), bothrid_W = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), n_loculi = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), n_transv_septa = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), stalk_L = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), stalk_W = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), prog_max_W = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), term_seg_L = c(500L, 
    NA, 400L, 420L, NA, NA), term_seg_L.1 = c(360L, NA, 220L, 
    230L, NA, NA), ratio_term_seg = c(1.39, NA, 1.82, 1.83, NA, 
    NA), term_seg_SA = c(1800, NA, 880, 966, NA, NA), pore_pst_mrgn = c(360L, 
    NA, 260L, 300L, NA, NA), percent_.prog_L = c(72L, NA, 65L, 
    71L, NA, NA), n_progl_LgrW = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), n_mat_segs = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), n_testes = c(NA, 6L, 6L, 5L, NA, NA), testes_L = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), testes_W = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), length_tst_field = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), term_c_sac_L = c(150L, 
    NA, 105L, 125L, NA, NA), term_c_sac_W = c(125L, NA, 75L, 
    95L, NA, NA), ovary_L = c(255L, NA, 140L, 135L, NA, NA), 
    Ov_ratio_prog = c(51, NA, 35, 32.1, NA, NA), OV_max_W = c(240, 
    NA, 125, 140, NA, NA)), .Names = c("group", "total_length", 
"max_w", "n_prog", "ceph_pedun_L", "bothrid_L", "bothrid_W", 
"n_loculi", "n_transv_septa", "stalk_L", "stalk_W", "prog_max_W", 
"term_seg_L", "term_seg_L.1", "ratio_term_seg", "term_seg_SA", 
"pore_pst_mrgn", "percent_.prog_L", "n_progl_LgrW", "n_mat_segs", 
"n_testes", "testes_L", "testes_W", "length_tst_field", "term_c_sac_L", 
"term_c_sac_W", "ovary_L", "Ov_ratio_prog", "OV_max_W"), row.names = 563:568, class = "data.frame")

I tried to do what you said with this code:

for(i in unique(data$group)) {
    corrplot(cor(data[data$group==i, -match("group", names(data))]))
}

But I got this error:

Error in if (min(corr) < -1 - .Machine$double.eps || max(corr) > 1 + .Machine$double.eps) { : 
  missing value where TRUE/FALSE needed

You need to calculate the correlation between the quantitative variables for each grouping variable, and apply corrplot to each. It would be helpful if you could add some data and your attempt. please have a read of this http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example thanks — user20650, Nov 09 '15 at 18:23
To get you started: `par(mfrow=c(2,1)) ; for(i in unique(mtcars$am)) corrplot(cor(mtcars[mtcars$am==i, -match("am", names(mtcars))]))` — user20650, Nov 09 '15 at 18:24

score 1 · Accepted Answer · answered Nov 09 '15 at 18:59

1

Upgrade comment

You need to calculate the correlation between the quantitative variables for each grouping variable, and then apply corrplot to each.

Using the iris dataset

par(mfrow=c(3,1)) 

# loop through the grouping variable
for(i in unique(iris$Species)) {
            corrplot(cor(iris[iris$Species==i, -match("Species", names(iris))]))
           }

The iris$Species==i subsets the rows of the data for each grouping variable, and -match("Species", names(iris)) removes the grouping variable column, so it is not included in the correlation calculation.

answered Nov 09 '15 at 18:59

user20650

24,654
5
56
91

I edited my post to show what happened when I tried to do what you said. – uller Nov 09 '15 at 19:59
okay, you need to account for the missing data. You do this within `cor`. See the `?cor` help page to see the options - A sensible way to go is likely `use="pairwise"`. Of course, if you have **a lot** of missing in variables then you could still end up with problems. But then you need to think what value a correlation estimate iis when estimated with few observations. – user20650 Nov 09 '15 at 20:13
I was using the rcorr from Hmisc package to do the correlations: m.data <- as.matrix(data) #returns the correlation matrix. take a look at the str() of this object. cormat <- rcorr(m.data, type="pearson") #plot the correlation matrix corrplot(as.matrix(cormat$r), type="upper", order="AOE", #p.mat = as.matrix(cormat$P), sig.level = 0.05, insig = "blank", method = "color", diag=FALSE, tl.col="black", tl.srt=45) – uller Nov 09 '15 at 20:16
Tried the use="pairwise" and the same error appeared – uller Nov 09 '15 at 20:24
I am not familiar with `rcorr` - but it appears as if it defaults to pairwise removal of missing ( with no option to change this). The problem you are having are of course due to the missing - you need to decide how best to deal with this as the `corrplot` function will not work if missing values are present. So the first thing to do is generate a correlation matrix with no missing. – user20650 Nov 09 '15 at 20:33
Things to try... you could remove columns with all / complete missing (`idx <- which(colSums(is.na(dat))!=nrow(dat)); newd <- dat[idx] ; cor(newd[-1], use="pairwise")` - but as you can see, for the tail of your data above, there is still some missing. So you need to look at these variables and decide if it worth including them - perhaps if the quantity of missing is greater than a certain proportion – user20650 Nov 09 '15 at 20:34

Run corrplot to a data frame by group

1 Answers1