Correlation of subsets of dataframe using aggregate

Question

I have a data frame made by row binding many data frames, each identified with a unique key. I wish to calculate the correlation coefficients for columns in each subset (using the unique key) of the big data frame. For example, using the mtcars data I might want to calculate the correlation between columns hp and wt for each unique value in column cyl. I could do it in a loop

data("mtcars")
for(i in c(4,6,8)){
temp = subset(mtcars,mtcars$cyl==i)
cor(temp$hp,temp$wt)
}

I think aggregate would be better, but this code doesn't work:

data("mtcars")
aggregate(mtcars,by=mycars$cyl,cor)

cryo111 · Accepted Answer · 2021-09-06T18:06:02.187

9

You could use

 data("mtcars")
 library(plyr)
 ddply(mtcars, "cyl", function(x) cor(x$hp, x$wt))

This splits the data in mtcars by cyl, applies for each subset x the function cor(x$hp, x$wt) and then aggregates the results for each of the subsets in a data.frame.

I can highly recommend the plyr package. It's one of the packages I use most in R.

Edit: As per request, here a dplyr version. I have to say that I am not a big dplyr user, but the code should be ok.

library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp, wt))

edited Sep 06 '21 at 18:06

answered Apr 24 '13 at 01:20

cryo111

4,444
1
15
37

so there is no way of dealing with this using aggregate? – Alex Apr 24 '13 at 01:39
http://stackoverflow.com/questions/14176756/difference-between-ddply-and-aggregate answers my question. I have accepted your answer, thanks again. – Alex Apr 24 '13 at 01:44
@Alex - No it's not job for `aggregate`, but for `split`. See my answer below. This is what `plyr` package tag-line is.. `split`-`apply`-`combine` – CHP Apr 24 '13 at 02:14
Works well. @cryo111 could you edit your answer with a `dplyr` version? – Alex Trueman May 19 '15 at 15:23
1

@Alex Have added a `dplyr` solution. – cryo111 May 19 '15 at 15:54

score 9 · Answer 2 · answered Apr 24 '13 at 02:12

9

In base R, it's job for split and lapply or sapply

lapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
## $`4`
## [1] 0.1598761
## 
## $`6`
## [1] -0.3062284
## 
## $`8`
## [1] 0.01761795
## 


sapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
##           4           6           8 
##  0.15987614 -0.30622844  0.01761795

answered Apr 24 '13 at 02:12

CHP

16,981
4
38
57

`by` = `split` + `lapply`. – thelatemail Dec 17 '15 at 02:21

Correlation of subsets of dataframe using aggregate

2 Answers2