0

I am trying to compute a correlation matrix for a dataframe (named "df") that contains both numeric variables, and boolean variables (true, false) and has some missing values.

DF is like

df <- data.frame(
idcode = c(1:10),
contract = c ("TRUE", "FALSE", "FALSE", "FALSE", NA, NA, "TRUE", "TRUE", 
"FALSE", "TRUE"),
score = c (1.17, 5, 7.2, 6.6, 3, 3.8, 7.2, 9.1, 5.4, 2.21),
CEO = c("FALSE", 
NA,"TRUE","TRUE","TRUE","TRUE","TRUE","TRUE","TRUE","TRUE"))

I have found two similar alternatives to compute this, but they give me different results:

data.matrix(df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)

and

model.matrix(~0+., data=df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)

Could someone please explain me why the two correlation matrixes differ, and what is the correct method to use to compute the correlation matrix in this case?

For example, why does the correlation for the pair CEO-Score differ?

Ludovico
  • 37
  • 5
  • 1
    Your edits improve the question, but we can't run your code: we don't have `df`. Could you include code to produce `df`, and show the differences in the resulting matrices? – user2554330 Sep 14 '21 at 11:58

1 Answers1

1

The two functions model.matrix and data.matrix behave differently in several ways, including what happens if there are NA values, and how non-numeric variables are handled. See the help pages.

By default, entire rows are deleted in the presence of NA when using model.matrix. In data.matrix, these are kept and contribute to cor(use = "pairwise.complete.obs") observations, if not the entire rows are NA. This explains the different correlation coefficients.

If you have to use model.matrix, you could set the option to pass NA values (see solution here) and handle NA values in cor(use="pairwise.complete.obs").

Get data

library(tidyverse)

df <- data.frame(
  idcode = c(1:10),
  contract = c(TRUE,FALSE,FALSE,FALSE,NA,NA,TRUE,TRUE,FALSE,TRUE),
  score = c (1.17, 5, 7.2, 6.6, 3, 3.8, 7.2, 9.1, 5.4, 2.21),
  CEO = c(FALSE,NA,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE))

Note that logical variables should be coded without "", but the results will look the same here.

Default behaviour of model.matrix

If there are NA values, model.matrix drops the entire row while data.matrix keeps them. This is due to the default options()$na.action, which is set to na.omit and which only affecs model.matrix.

options()$na.action
#[1] "na.omit"

model.matrix(~0 + ., data = df)
#>    idcode contractFALSE contractTRUE score CEOTRUE
#> 1       1             0            1  1.17       0
#> 3       3             1            0  7.20       1
#> 4       4             1            0  6.60       1
#> 7       7             0            1  7.20       1
#> 8       8             0            1  9.10       1
#> 9       9             1            0  5.40       1
#> 10     10             0            1  2.21       1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#> 
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"

data.matrix(df)
#>       idcode contract score CEO
#>  [1,]      1        2  1.17   1
#>  [2,]      2        1  5.00  NA
#>  [3,]      3        1  7.20   2
#>  [4,]      4        1  6.60   2
#>  [5,]      5       NA  3.00   2
#>  [6,]      6       NA  3.80   2
#>  [7,]      7        2  7.20   2
#>  [8,]      8        2  9.10   2
#>  [9,]      9        1  5.40   2
#> [10,]     10        2  2.21   2

Behaviour with na.action = "na.pass"

# set na.action options
oldpar <- options()$na.action
options(na.action ="na.pass")

model.matrix(~0 + ., data = df)
#>    idcode contractFALSE contractTRUE score CEOTRUE
#> 1       1             0            1  1.17       0
#> 2       2             1            0  5.00      NA
#> 3       3             1            0  7.20       1
#> 4       4             1            0  6.60       1
#> 5       5            NA           NA  3.00       1
#> 6       6            NA           NA  3.80       1
#> 7       7             0            1  7.20       1
#> 8       8             0            1  9.10       1
#> 9       9             1            0  5.40       1
#> 10     10             0            1  2.21       1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#> 
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"

data.matrix(df)
#>       idcode contract score CEO
#>  [1,]      1        2  1.17   1
#>  [2,]      2        1  5.00  NA
#>  [3,]      3        1  7.20   2
#>  [4,]      4        1  6.60   2
#>  [5,]      5       NA  3.00   2
#>  [6,]      6       NA  3.80   2
#>  [7,]      7        2  7.20   2
#>  [8,]      8        2  9.10   2
#>  [9,]      9        1  5.40   2
#> [10,]     10        2  2.21   2

Compare correlation coefficients

data.matrix(df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#>          idcode contract  score    CEO
#> idcode    1.000    0.312  0.177  0.625
#> contract  0.312    1.000 -0.226 -0.354
#> score     0.177   -0.226  1.000  0.548
#> CEO       0.625   -0.354  0.548  1.000

model.matrix(~0+., data=df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#>               idcode contractFALSE contractTRUE  score CEOTRUE
#> idcode         1.000        -0.312        0.312  0.177   0.625
#> contractFALSE -0.312         1.000       -1.000  0.226   0.354
#> contractTRUE   0.312        -1.000        1.000 -0.226  -0.354
#> score          0.177         0.226       -0.226  1.000   0.548
#> CEOTRUE        0.625         0.354       -0.354  0.548   1.000

Note that the two functions handle logical variables data differently (model.matrix creates two dummy variables for contract, and one dummy variable for CEO (see discussion in the comments section to this Answer), data.matrix creates a single binary integer variable), which is reflected in the correlation matrix.

reset default options

options(na.action = oldpar)

Session Info

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] knitr_1.33      magrittr_2.0.1  rlang_0.4.11    fastmap_1.1.0  
#>  [5] fansi_0.5.0     stringr_1.4.0   styler_1.5.1    highr_0.9      
#>  [9] tools_4.1.1     xfun_0.25       utf8_1.2.2      withr_2.4.2    
#> [13] htmltools_0.5.2 ellipsis_0.3.2  yaml_2.2.1      digest_0.6.27  
#> [17] tibble_3.1.4    lifecycle_1.0.0 crayon_1.4.1    purrr_0.3.4    
#> [21] vctrs_0.3.8     fs_1.5.0        glue_1.4.2      evaluate_0.14  
#> [25] rmarkdown_2.10  reprex_2.0.1    stringi_1.7.4   compiler_4.1.1 
#> [29] pillar_1.6.2    backports_1.2.1 pkgconfig_2.0.3

Created on 2021-09-19 by the reprex package (v2.0.1)

scrameri
  • 667
  • 2
  • 12
  • You are considering the columns as characcter rather than loigical. That should make the two have the same results – Onyambu Sep 19 '21 at 22:59
  • Hi and thanks! I just changed them to logicals, but the two functions still treat them differently (model.matrix creates two dummy variables from one logical variable, while data.matrix treats a logical as a binary). – scrameri Sep 20 '21 at 09:07
  • checkk again. when you have a logical variable, it is already 0, 1 so no two dummy variables will be created. for example why do you have CEO TRUE but not CEO FALSE yet you have both contract TRUE and contract FALSE columns? – Onyambu Sep 20 '21 at 15:27
  • 1
    Hi Onyambu. I see now why it's weird, one logical (contract) gives two dummy variables while the other (CEO) doesn't. But I checked again, with the same results (it's a reprex example after all). I don't know why this happens, any idea? Maybe you could check it again? – scrameri Sep 20 '21 at 16:03