15

I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.

Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.

Metrics
  • 15,172
  • 7
  • 54
  • 83
Cody
  • 273
  • 1
  • 2
  • 8
  • 1
    @mnel - `?cor` does not give p-values though. `rcorr` from the `Hmisc` library will however do so. – thelatemail Nov 21 '12 at 04:31
  • 1
    @thelatemail `?cor` will give a link to `cor.test` which will (If the OP follows the suggestion, he should find that). – mnel Nov 21 '12 at 04:35
  • 1
    @mnel - though to my knowledge `cor.test` does not operate on a matrix/data.frame simply like `cor` does. – thelatemail Nov 21 '12 at 04:51

4 Answers4

10

Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:

cor.prob <- function(X, dfr = nrow(X) - 2) {
  R <- cor(X)
  above <- row(R) < col(R)
  r2 <- R[above]^2
  Fstat <- r2 * dfr / (1 - r2)
  R[above] <- 1 - pf(Fstat, 1, dfr)

  cor.mat <- t(R)
  cor.mat[upper.tri(cor.mat)] <- NA
  cor.mat
}

So let's test it out:

set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)

          [,1]      [,2]      [,3]      [,4] [,5]
[1,] 1.0000000        NA        NA        NA   NA
[2,] 0.7005361 1.0000000        NA        NA   NA
[3,] 0.5990483 0.6816955 1.0000000        NA   NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000   NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835    1

Does that line up with cor.test?

cor.test(data[,2], data[,3])

 Pearson's product-moment correlation
data:  data[, 2] and data[, 3] 
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.3603246  0.5178982 
sample estimates:
       cor 
0.09778865 

Seems to work ok.

sebastian-c
  • 15,057
  • 3
  • 47
  • 93
  • This definitely seems like it'll be a good solution, however, I'm having trouble figuring out how to first construct a matrix out of my data. As I said, I have a data frame with ~400 variables (columns) and 371 subjects (rows), but doing something like cor.prob(matrix(var10, var11)) seems to be giving me correlations between each row in those variables, instead of one coefficient representing the correlation between the two variables. – Cody Nov 21 '12 at 14:01
  • So I've figured out a solution to my last comment—it looks like I should be doing something like this for my data: `cor(cbind(data$var1, data$var2), use="complete.obs")` However, I'm having trouble extending your code to include `use="complete.obs"`. Any ideas on this? – Cody Nov 21 '12 at 20:07
  • @Cody Try using `na.omit` on the data frame to create a new data frame and see if that fixes your issue. – sebastian-c Nov 22 '12 at 00:17
  • This worked well, and is producing the results I was looking for. Thanks again for your help! – Cody Nov 28 '12 at 02:59
5

Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.

  pairwiseCor <- function(dataframe){
  pairs <- combn(names(dataframe), 2, simplify=FALSE)
  df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)), 
                   AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
  for(i in 1:length(pairs)){
    df[i,1] <- pairs[[i]][1]
    df[i,2] <- pairs[[i]][2]
    df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
    df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
  }
  pairwiseCorDF <- df
  pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
  row.names(pairwiseCorDF) <- 1:length(pairs)
  pairwiseCorDF <<- pairwiseCorDF
  pairwiseCorDF
  }

This is what the output is:

 > head(pairwiseCorDF)
             Vairable1        Variable2 AbsCor     Cor
    1        roll_belt     accel_belt_z 0.9920 -0.9920
    2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
    3        roll_belt total_accel_belt 0.9811  0.9811
    4 total_accel_belt     accel_belt_z 0.9752 -0.9752
    5       pitch_belt     accel_belt_x 0.9658 -0.9658
    6 gyros_dumbbell_z  gyros_forearm_z 0.9491  0.9491
user3728456
  • 203
  • 1
  • 3
  • 7
4

I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help: http://www.inside-r.org/packages/cran/picante/docs/cor.table

Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:

install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
jscole
  • 66
  • 4
3

You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.

Simplest function call is just to pass the data frame:

sjt.corr(df)

See examples here.

Daniel
  • 7,252
  • 6
  • 26
  • 38