Between-Within Correlations

Question

DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
                  "TYPE" = rep(1:2, 400),
                  "TIME" = rep(100:101, 400),
                  "SCORE" = sample(1:100,r=T,800))

Cheers all,

I have 'DATA' and wish to estimation the CORRELATION VALUES of SCORE at each TIME and SCORE and TYPE combination BETWEEN AND WITHIN GROUP in this way:

(1) You're using random data, and the results are not perfectly clear, please start with `set.seed(42)` (or some number) so that we are all using the same random data for reproducibility. (2) What is your expected output? Given this structure, how many rows are you expecting? — r2evans, Apr 29 '22 at 17:19

score 0 · Answer 1 · answered Apr 29 '22 at 18:14

I am assuming you want to compute the correlation between groups 1-2, 1-3, 1-4 and so on for each combination of TIME and TYPE. Here's an approach:

# create the dataset
set.seed(123)
df <- data.frame("group" = sort(rep(1:4, 200)),
                 "type" = rep(1:2, 400),
                 "time" = rep(100:101, 400),
                 "score" = sample(1:100,r=T,800))
library(tidyr)
library(purrr)
library(data.table)

# another dataset to filter combinations 
# (G1G2 is same G2G1, so remove G2G1)
df2 <- combn(4, 2) %>% t %>% 
    as_tibble() %>% 
    rename(group1 = V1, group2 = V2) %>% 
    mutate(value = TRUE)

df %>% 
    
    # add identifiers per group
    group_by(time, type, group) %>% 
    mutate(id = row_number()) %>% 
    ungroup() %>% 
    
    # nest data to get separate tibble for each
    # combination of time and type
    nest(data = -c(time, type)) %>% 
    
    # convert each data.frame to data.table
    mutate(dt = map(data, function(dt){
        setDT(dt)
        setkey(dt, id)
        dt
    })) %>% 
    
    # correlation between groups in R
    # refer answer below for more details
    # https://stackoverflow.com/a/26357667/15221658
    
    # cartesian join of dts
    mutate(dtj = map(dt, ~.[., allow.cartesian = TRUE])) %>% 
    
    # compute between group correlation
    mutate(cors = map(dtj, ~.[, list(cors = cor(score, i.score)), by = list(group, i.group)])) %>% 
    
    # unnest correlation object
    unnest(cors) %>% 
    
    # formatting for display
    select(type, time, group1 = group, group2 = i.group, correlation = cors) %>% 
    filter(group1 != group2) %>%
    arrange(time, group1, group2) %>% 

    # now use df2 since currently we have G1G2, and G2G1
    # which are both equal so remove G2G1
    left_join(df2, by = c("group1", "group2")) %>% 
    filter(value) %>% 
    select(-value)

# A tibble: 12 x 5
    type  time group1 group2 correlation
   <int> <int>  <int>  <int>       <dbl>
 1     1   100      1      2     0.121  
 2     1   100      1      3     0.0543 
 3     1   100      1      4    -0.0694 
 4     1   100      2      3    -0.104  
 5     1   100      2      4    -0.0479 
 6     1   100      3      4    -0.0365 
 7     2   101      1      2    -0.181  
 8     2   101      1      3    -0.0673 
 9     2   101      1      4     0.00765
10     2   101      2      3     0.0904 
11     2   101      2      4    -0.0126 
12     2   101      3      4    -0.154

Uwe · Answer 2 · 2022-05-04T21:35:49.490

Here is an alternative approach which creates all unique combinations of TIME, TYPE, and duplicated GROUPs through a cross join and then computes the correlation of SCORE for the correspondings subsets of DATA:

library(data.table) # development version 1.14.3 required
setDT(DATA, key = c("GROUP", "TYPE", "TIME"))[
  , CJ(time = TIME, type = TYPE, groupA = GROUP, groupB = GROUP, unique = TRUE)][
    groupA < groupB][
      , corType := paste0("G", groupA, "G", groupB)][][
        , corValue := cor(DATA[.(groupA, type, time), SCORE], 
                          DATA[.(groupB, type, time), SCORE]), 
        by = .I][]

    time type groupA groupB corType    corValue
 1:  100    1      1      2    G1G2  0.11523940
 2:  100    1      1      3    G1G3 -0.05124326
 3:  100    1      1      4    G1G4 -0.16943203
 4:  100    1      2      3    G2G3  0.05475435
 5:  100    1      2      4    G2G4 -0.10769738
 6:  100    1      3      4    G3G4  0.01464146
 7:  100    2      1      2    G1G2          NA
 8:  100    2      1      3    G1G3          NA
 9:  100    2      1      4    G1G4          NA
10:  100    2      2      3    G2G3          NA
11:  100    2      2      4    G2G4          NA
12:  100    2      3      4    G3G4          NA
13:  101    1      1      2    G1G2          NA
14:  101    1      1      3    G1G3          NA
15:  101    1      1      4    G1G4          NA
16:  101    1      2      3    G2G3          NA
17:  101    1      2      4    G2G4          NA
18:  101    1      3      4    G3G4          NA
19:  101    2      1      2    G1G2 -0.04997479
20:  101    2      1      3    G1G3 -0.02262932
21:  101    2      1      4    G1G4 -0.00331578
22:  101    2      2      3    G2G3 -0.01243952
23:  101    2      2      4    G2G4  0.16683223
24:  101    2      3      4    G3G4 -0.10556083
    time type groupA groupB corType    corValue

Explanation

DATA is coerced to class data.table while setting a key on columns GROUP, TYPE, and TIME. Keying is required for fast subsetting later.
The cross join CJ() creates all unique combinations of columns TIME, TYPE, GROUP, and GROUP (twice). The columns of the cross join have been renamed to avoid name clashes later on.
[groupA < groupB] ensures that equivalent combinations of groupA and groupB only appear once, e.g., G2G1 is dropped in favour of G1G2. So, this is kind of data.table version of t(combn(unique(DATA$GROUP), 2)).
A new column corType is append by reference.
Finally, the groupwise correlations are computed by stepping rowwise through the cross join table (using by = .I) and subsetting DATA by groupA, type, time and groupB, type, time, resp., using fast subsetting through keys. Please, see the vignette Keys and fast binary search based subset for more details.
Note that by = .I is a new feature of data.table development version 1.14.3.

Combinations of time, type, and group which do not exist in DATA will appear in the result set but are marked by NA in column corValue.

Data

set.seed(42) # required for reproducible data
DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
                  "TYPE" = rep(1:2, 400),
                  "TIME" = rep(100:101, 400),
                  "SCORE" = sample(1:100, r=T, 800))

Between-Within Correlations

2 Answers2

Explanation

Data