6

I have a data frame made by row binding many data frames, each identified with a unique key. I wish to calculate the correlation coefficients for columns in each subset (using the unique key) of the big data frame. For example, using the mtcars data I might want to calculate the correlation between columns hp and wt for each unique value in column cyl. I could do it in a loop

data("mtcars")
for(i in c(4,6,8)){
temp = subset(mtcars,mtcars$cyl==i)
cor(temp$hp,temp$wt)
}

I think aggregate would be better, but this code doesn't work:

data("mtcars")
aggregate(mtcars,by=mycars$cyl,cor)
Alex
  • 15,186
  • 15
  • 73
  • 127

2 Answers2

9

You could use

 data("mtcars")
 library(plyr)
 ddply(mtcars, "cyl", function(x) cor(x$hp, x$wt))

This splits the data in mtcars by cyl, applies for each subset x the function cor(x$hp, x$wt) and then aggregates the results for each of the subsets in a data.frame.

I can highly recommend the plyr package. It's one of the packages I use most in R.


Edit: As per request, here a dplyr version. I have to say that I am not a big dplyr user, but the code should be ok.

library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp, wt))
cryo111
  • 4,444
  • 1
  • 15
  • 37
9

In base R, it's job for split and lapply or sapply

lapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
## $`4`
## [1] 0.1598761
## 
## $`6`
## [1] -0.3062284
## 
## $`8`
## [1] 0.01761795
## 


sapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
##           4           6           8 
##  0.15987614 -0.30622844  0.01761795 
CHP
  • 16,981
  • 4
  • 38
  • 57