Pearson Correlation "Stratified" by Categorical Variable

Question

I am very new to R.

I am interested in calculating Pearson Correlations for my data. I have successfully figured out how to calculate the correlation of two continuous variables within my data set, x and y; however, I am hoping to "stratify" the correlations by a third, categorical variable: state. I would like to be able to say "the correlation co-efficient/p-value of x and y is [Result] in [State]."

I have tried the group_by method, located in the dplyr package, housed within the cor.test (shown below). I am in need of both the coefficients and the p-values, so I have been trying to use the cor.test method. I have also tried using a matrix method, but was unsuccessful that way as well.

Data<-read.csv("PATHWAYNAME")
  library(dplyr)
  CCor<-cor.test(Data$x, Data$y,
          method=c("pearson"), group_by(State))
  CCor

I am able to run each set of values for each state individually to get the coefficients and p-values; however, I am certain that there is a more efficient way to complete this task. My data is large enough that it will be beyond tedious to run them individually.

Thank you in advance for your help!

UPDATE: Using this as a sample data set that is extremely truncated, but similarly represents the variables in my own, I would like to know if the average income correlates with the number of visits in each state listed; that is, does the average income have a positive or negative correlation with the number of visits in the state of Alabama?

>State  NumVis  AvgIncome
>IN       45        60000
>AL       100       56000
>AK       45        80000
>ME       89        54000
>NC       120       100000
>SC       356       43000
>ND       100       25000
>SD       63        20000
>MN       54        46000
>ID       85        55000

When running this data using the code indicated below, my outcome is this:

 CorrDat<-read.csv("File")
     CorrDat %>%
       group_by(State) %>%
        do(tidy(cor.test(CorrDat$NumVis, CorrDat$Income, method="pearson")))

Results

Would you be able to help clarify what it is that I am doing incorrectly with this code or if I need to use an alternative method to accomplish this task?

Effectively, I'm looking for the equivalent of the Class Statement in SAS, but for R. — user9165024, Jan 02 '18 at 18:50
Its easier to help you if you provide a proper [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data and the desired output for that input. — MrFlick, Jan 02 '18 at 18:51

missuse · Accepted Answer · 2018-01-03T19:47:51.727

There are several ways this can be accomplished in R. dplyr or more generally tidyverse are a popular group of tools that are able to achieve the desired result. The key difference in these tools is the pipe %>% which provides a means to write code from left to right instead of from the inside out (or to make a bunch of intermediary objects in the environment). Although the pipe can be used with base R its popularity came with dplyr.

Here are several examples on mtcars data set. The key functions are do and map which are quite versatile. I suggest running ?do and ?map.

library(tidyverse)

mtcars %>%
  group_by(cyl) %>%
  summarize(cor = cor(mpg, disp))
#output
# A tibble: 3 x 2
    cyl correlation
  <dbl>       <dbl>
1     4  -0.8052361
2     6   0.1030827
3     8  -0.5197670

another way is:

mtcars  %>% 
  group_by(cyl) %>%
  do(cor = cor(.$mpg, .$disp)) %>%
  unnest()

or for more variables:

mtcars  %>% 
  group_by(cyl) %>%
  do(cor = as.data.frame(cor(.[,-2]) )) %>%
  unnest()

an example with cor.test:

library(broom)

mtcars  %>% 
  group_by(cyl)  %>% 
  do(tidy(cor.test(.$mpg, .$disp))) 
#output
    cyl   estimate  statistic     p.value parameter   conf.low   conf.high                               method alternative
  <dbl>      <dbl>      <dbl>       <dbl>     <int>      <dbl>       <dbl>                               <fctr>      <fctr>
1     4 -0.8052361 -4.0740206 0.002782827         9 -0.9474526 -0.39724826 Pearson's product-moment correlation   two.sided
2     6  0.1030827  0.2317344 0.825929685         5 -0.7046776  0.79446840 Pearson's product-moment correlation   two.sided
3     8 -0.5197670 -2.1075838 0.056774876        12 -0.8232990  0.01492976 Pearson's product-moment correlation   two.sided

and yet another way using purrr::map:

mtcars  %>% 
  split(.$cyl)  %>% 
  map(~cor.test(.x$mpg, .x$disp))

which gives a list, which can be manipulated with the same or another map function:

mtcars  %>% 
  split(.$cyl)  %>% 
  map(~cor.test(.x$mpg, .x$disp)) %>%
  map_dbl("p.value")
#output:
          4           6           8 
0.002782827 0.825929685 0.056774876

to extract the coefficients:

mtcars  %>% 
  split(.$cyl)  %>% 
  map(~cor.test(.x$mpg, .x$disp)) %>%
  map(~data.frame(cor = .x$estimate, p = .x$p.value)) #check also `map_dfr` and `map_dfc`

#output
$`4`
           cor           p
cor -0.8052361 0.002782827

$`6`
          cor         p
cor 0.1030827 0.8259297

$`8`
          cor          p
cor -0.519767 0.05677488

UPDATE: answer to the updated question:

The problem is how you specify the do call. This is correct:

df %>%
  group_by(State) %>%
  do(tidy(cor.test(.$NumVis, .$AvgIncome, method="pearson")))

where the . represents the data passed by the previous pipe. In the example posted this results in:

Error in cor.test.default(.$NumVis, .$AvgIncome, method = "pearson") : 
not enough finite observations

which is reasonable considering only 1 observation per group is suplied

what you did is:

CorrDat<-read.csv("File")
     CorrDat %>%
       group_by(State) %>%
        do(tidy(cor.test(CorrDat$NumVis, CorrDat$Income, method="pearson")))

passing the whole CorrDat set to the do function so it performed the same operation as many times as there are groups.

The %>% pipe assumes the data passed will be used as the first argument in the following function, if it should not the data can be refereed to as .. You can perform operations like .$column or .[,2] etc.

@user9165024 updated the post. The error was spotted and explained. If anything is unclear please let me know. Usually when clarification is achieved the comments can be deleted. I deleted mine. — missuse, Jan 03 '18 at 19:48
Would you be able to further clarify the first error you received? I am getting that error when I run my data: Error in cor.test.default(.$NumVis, .$AvgIncome, method = "pearson") : not enough finite observations. I made the alterations to the code as you suggested! Thank you for your thorough explanation! — user9165024, Jan 04 '18 at 14:16
@user9165024 If any of the groups has 2 or less observations this error will occur. Try: `cor.test(1:2, rnorm(2))` and `cor.test(1:3, rnorm(3))`. A strait line can be defined by two points so there is no point calculating correlation, It will be 1. Remove groups that have less than 3 observations and run the code again. — missuse, Jan 04 '18 at 18:41

score 0 · Answer 2 · answered Jan 02 '18 at 19:59

With base r, you could use by.

E.g., replicating one of the examples in missuse's post:

do.call(rbind,
        by(mtcars, mtcars$cyl, FUN = function(x) cor.test(x$mpg, x$disp, data = x)))

statistic parameter p.value     estimate   null.value alternative method                                 data.name          conf.int 
4 -4.074021 9         0.002782827 -0.8052361 0          "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
6 0.2317344 5         0.8259297   0.1030827  0          "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
8 -2.107584 12        0.05677488  -0.519767  0          "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2

Pearson Correlation "Stratified" by Categorical Variable

2 Answers2