Correlations between numerous variables grouped in dplyr

Question

Say I have a data frame, like this:

# Set RNG seed
set.seed(33550336)

# Create dummy data frame
df <- data.frame(PC1 = runif(20),
                 PC2 = runif(20),
                 PC3 = runif(20),
                 A = runif(20),
                 B = runif(20),
                 loc = sample(LETTERS[1:2], 20, replace = TRUE),
                 seas = sample(c("W", "S"), 20, replace = TRUE))

# > head(df)
#         PC1        PC2       PC3         A         B loc seas
# 1 0.8636470 0.02220823 0.7553348 0.4679607 0.0787467   A    S
# 2 0.3522257 0.42733152 0.2412971 0.6691419 0.1194121   A    W
# 3 0.5257408 0.44293320 0.3225228 0.0934192 0.2966507   B    S
# 4 0.0667227 0.90273594 0.6297959 0.1962124 0.4894373   A    W
# 5 0.3751383 0.50477920 0.6567203 0.4510632 0.4742191   B    S
# 6 0.9197086 0.32024904 0.8382138 0.9907894 0.9335657   A    S

I'm interested in calculating correlations between PC1, PC2, and PC3 and each of the variables A and B grouped by loc and seas. So, for example, based on this answer, I could do the following:

# Correlation of variable A and PC1 per loc & seas combination
df %>% 
  group_by(loc, seas) %>% 
  summarise(cor = cor(PC1, A)) %>% 
  ungroup

# # A tibble: 4 x 3
#   loc   seas      cor
#   <fct> <fct>   <dbl>
# 1 A     S      0.458 
# 2 A     W      0.748 
# 3 B     S     -0.0178
# 4 B     W     -0.450

This gives me what I want: the correlation between PC1 and A for each combination of loc and seas. Awesome.

What I'm struggling with is extrapolating this to perform the calculation for each combination of PC* variables and other variables (i.e., A and B, in the example). My expected outcome is the tibble immediately above, but with a column for each combination for PC* and the other variables. I could do this long hand... cor(PC2, A), cor(PC3, A), cor(PC1, B), etc., but presumably there is there a succinct way of coding the calculation. I suspect it involves do, but I can't quite get my head around it... Can someone enlighten me?

Solution

I went with G. Grothendieck's solution below, but this required some restructuring to get it into the required format. I have posted the code I used here in case it is useful for others.

# Perform calculation
res <- by(df[1:5], df[-(1:5)], cor)

# Combinations of loc & seas 
comb <- expand.grid(dimnames(res))

#   loc seas
# 1   A    S
# 2   B    S
# 3   A    W
# 4   B    W

# A matrix corresponding to a loc & seas
# Plus the loc & seas themselves
restructure <- function(m, n){
  # Convert to data frame
  # Add rownames as column
  # Retains PCs as rows, but not columns
  # Gather variables to long format
  # Unite PC & variable names
  # Spread to a single row
  # Add combination of loc & seas
  m %>% 
    data.frame %>% 
    rownames_to_column() %>% 
    filter(grepl("PC", rownames(m))) %>% 
    select(-contains("PC")) %>% 
    gather(variable, value, -rowname) %>% 
    unite(comb, rowname, variable) %>% 
    spread(comb, value) %>% 
    bind_cols(n)
}

# Restructure each list element & combine into data frame
do.call(rbind, lapply(1:length(res), function(x)restructure(res[[x]], comb[x, ])))

which gives,

#         PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B loc seas
# 1  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403   A    S
# 2 -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262   B    S
# 3  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690   A    W
# 4 -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706   B    W

I m not sure but those seem to be eigenvectors :) – Sotos Dec 13 '18 at 15:18 — Sotos, Dec 13 '18 at 15:18

G. Grothendieck · Accepted Answer · 2018-12-14T20:29:31.083

Use by like this:

By <- by(df[1:5], df[-(1:5)], cor)

giving:

> By
loc: A
seas: S
            PC1        PC2        PC3          A           B
PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: S
            PC1         PC2         PC3           A          B
PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: A
seas: W
           PC1         PC2        PC3          A           B
PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: W
           PC1        PC2        PC3          A          B
PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

ADDED

Based on further discussion by poster on what is wanted define the onerow function which accepts a correlation matrix or a data frame (in the latter case it converts the first 5 columns to a correlatoin matrix) producing one row of the output. The if statement in onerow is not needed, but won't hurt, for the adply line of code but we have included it so that onerow also works in a simple manner in subsequent examples below as well.

library(plyr)

onerow <- function(x) {
  if (is.data.frame(x)) x <- cor(x[1:5])
  dtab <- as.data.frame.table(x[4:5, 1:3])
  with(dtab, setNames(Freq, paste(Var2, Var1, sep = "_")))
}

adply(By, 1:2, onerow)

giving:

  loc seas       PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B
1   A    S  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403
2   B    S -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262
3   A    W  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690
4   B    W -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706

or perhaps get rid of by altogether and use this giving the same output:

library(plyr)
ddply(df, -(1:5), onerow)

or using dplyr:

library(dplyr)
df %>%
  group_by_at(-(1:5)) %>%
  do( onerow(.) %>% t %>% as.data.frame ) %>%
  ungroup

Thanks! I went with this solution, but have added the code used to restructure these results into the required format to the question. — Dan, Dec 13 '18 at 16:29

akrun · Answer 2 · 2018-12-13T15:46:59.037

We can do a split and cor in base R

lapply(split(df[1:5], df[-(1:5)]), cor)
#$A.S
#            PC1        PC2        PC3          A           B
#PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
#PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
#PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
#A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
#B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000

#$B.S
#            PC1         PC2         PC3           A          B
#PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
#PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
#PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
#A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
#B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000

#$A.W
#           PC1         PC2        PC3          A           B
#PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
#PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
#PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
#A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
#B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000

#$B.W
#           PC1        PC2        PC3          A          B
#PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
#PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
#PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
#A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
#B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

Or using tidyverse

library(tidyverse)
df %>% 
    group_by_at(6:7) %>% 
    nest %>% 
    mutate(data = map(data, cor))

Sotos · Answer 3 · 2018-12-13T15:47:23.407

Here is a solution via tidyverse where we use summarise_at to specify all PC[0-9] and correlate them with A. Same procedure for B, and then simply merge, i.e.

library(tidyverse)

df %>% 
 group_by(loc, seas) %>% 
 summarise_at(vars(starts_with('PC')), funs(cor(., A))) %>% 
 left_join(., df %>% 
                 group_by(loc, seas) %>% 
                 summarise_at(vars(starts_with('PC')), funs(cor(., B))), 
          by = c('loc', 'seas'), suffix = c('.A', '.B'))

which gives,

# A tibble: 4 x 8
# Groups:   loc [?]
  loc   seas    PC1.A  PC2.A  PC3.A    PC1.B   PC2.B PC3.B
  <fct> <fct>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <dbl>
1 A     S      0.458   0.352 -0.200 -0.00925  0.209  0.374
2 A     W      0.748  -0.399 -0.590  0.496   -0.0523 0.361
3 B     S     -0.0178 -0.350  0.803 -0.743    0.463  0.458
4 B     W     -0.450  -0.990  0.761 -0.667   -0.810  0.374

I like this solution because it gives the result in the required format, but can it be generalised to an arbitrary number of variables without having to add another `left_join...` line for each? — Dan, Dec 13 '18 at 16:33
It can. We will have to loop over the columns `A`, `B`, ... I can give it a shot tomorrow. — Sotos, Dec 13 '18 at 16:59

Correlations between numerous variables grouped in dplyr

Solution

3 Answers3

ADDED