2

How would you create a function that manually calculates pearson correlation in r. I know that there is a native function called cor, but what if I want to apply the below equation in R to each combination of columns in a data frame, how would I do it?

enter image description here

I wish I knew how, but I believe it requires many for-loops, nested for-loops etc to make it happen and I am not that strong at programming yet. I hope someone will attempt such that a newbie like me can learn. Thanks

Example:

  set.seed(1)
  DF = data.frame(V1 = rnorm(10), V2=rnorm(10), V3=rnorm(10), V4=rnorm(10))

  #     V1    V2    V3    V4
  # V1  1.00 -0.38 -0.72 -0.24
  # V2 -0.38  1.00  0.60  0.18
  # V3 -0.72  0.60  1.00  0.08
  # V4 -0.24  0.18  0.08  1.00
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
janman
  • 35
  • 7

2 Answers2

4

First write a helper function to calculate covariance:

v <- function(x,y=x) mean(x*y) - mean(x)*mean(y)

Then use it to calculate correlation:

my_corr <- function(x,y) v(x,y) / sqrt(v(x) * v(y))

Here's a quick check that it works correctly:

> my_corr(DF$V1, DF$V2)
[1] -0.3767034
> cor(DF$V1, DF$V2)
[1] -0.3767034

Note that calculating correlation this way is numerically unstable.

EDIT:

To apply it to all combinations of columns, use outer :

> outer(DF, DF, Vectorize(my_corr))

                  V1    V2    V3    V4
            # V1  1.00 -0.38 -0.72 -0.24
            # V2 -0.38  1.00  0.60  0.18
            # V3 -0.72  0.60  1.00  0.08
            # V4 -0.24  0.18  0.08  1.00
Boro Dega
  • 393
  • 1
  • 3
  • 13
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
  • Doesn't work, when I insert my data frame my_corr(DF)` Remember it is a **data frame** where the function should work not on vectors. – janman Apr 19 '16 at 22:23
  • But here you are defining your columns with $, a function should be generalized such that it finds columns by its own. What if you have 100 columns? How would you write that function? – janman Apr 19 '16 at 22:27
0

Well. You don't need to do this "manually", you just use....

cor(DF)

... which calculates r's for all combinations of columns.

lebatsnok
  • 6,329
  • 2
  • 21
  • 22