The two functions model.matrix
and data.matrix
behave differently in several ways, including what happens if there are NA
values, and how non-numeric variables are handled. See the help pages.
By default, entire rows are deleted in the presence of NA
when using model.matrix
. In data.matrix
, these are kept and contribute to cor(use = "pairwise.complete.obs")
observations, if not the entire rows are NA
. This explains the different correlation coefficients.
If you have to use model.matrix
, you could set the option to pass NA
values (see solution here) and handle NA
values in cor(use="pairwise.complete.obs")
.
Get data
library(tidyverse)
df <- data.frame(
idcode = c(1:10),
contract = c(TRUE,FALSE,FALSE,FALSE,NA,NA,TRUE,TRUE,FALSE,TRUE),
score = c (1.17, 5, 7.2, 6.6, 3, 3.8, 7.2, 9.1, 5.4, 2.21),
CEO = c(FALSE,NA,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE))
Note that logical variables should be coded without "", but the results will look the same here.
Default behaviour of model.matrix
If there are NA
values, model.matrix
drops the entire row while data.matrix
keeps them. This is due to the default options()$na.action
, which is set to na.omit
and which only affecs model.matrix
.
options()$na.action
#[1] "na.omit"
model.matrix(~0 + ., data = df)
#> idcode contractFALSE contractTRUE score CEOTRUE
#> 1 1 0 1 1.17 0
#> 3 3 1 0 7.20 1
#> 4 4 1 0 6.60 1
#> 7 7 0 1 7.20 1
#> 8 8 0 1 9.10 1
#> 9 9 1 0 5.40 1
#> 10 10 0 1 2.21 1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"
data.matrix(df)
#> idcode contract score CEO
#> [1,] 1 2 1.17 1
#> [2,] 2 1 5.00 NA
#> [3,] 3 1 7.20 2
#> [4,] 4 1 6.60 2
#> [5,] 5 NA 3.00 2
#> [6,] 6 NA 3.80 2
#> [7,] 7 2 7.20 2
#> [8,] 8 2 9.10 2
#> [9,] 9 1 5.40 2
#> [10,] 10 2 2.21 2
Behaviour with na.action = "na.pass"
# set na.action options
oldpar <- options()$na.action
options(na.action ="na.pass")
model.matrix(~0 + ., data = df)
#> idcode contractFALSE contractTRUE score CEOTRUE
#> 1 1 0 1 1.17 0
#> 2 2 1 0 5.00 NA
#> 3 3 1 0 7.20 1
#> 4 4 1 0 6.60 1
#> 5 5 NA NA 3.00 1
#> 6 6 NA NA 3.80 1
#> 7 7 0 1 7.20 1
#> 8 8 0 1 9.10 1
#> 9 9 1 0 5.40 1
#> 10 10 0 1 2.21 1
#> attr(,"assign")
#> [1] 1 2 2 3 4
#> attr(,"contrasts")
#> attr(,"contrasts")$contract
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$CEO
#> [1] "contr.treatment"
data.matrix(df)
#> idcode contract score CEO
#> [1,] 1 2 1.17 1
#> [2,] 2 1 5.00 NA
#> [3,] 3 1 7.20 2
#> [4,] 4 1 6.60 2
#> [5,] 5 NA 3.00 2
#> [6,] 6 NA 3.80 2
#> [7,] 7 2 7.20 2
#> [8,] 8 2 9.10 2
#> [9,] 9 1 5.40 2
#> [10,] 10 2 2.21 2
Compare correlation coefficients
data.matrix(df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#> idcode contract score CEO
#> idcode 1.000 0.312 0.177 0.625
#> contract 0.312 1.000 -0.226 -0.354
#> score 0.177 -0.226 1.000 0.548
#> CEO 0.625 -0.354 0.548 1.000
model.matrix(~0+., data=df) %>% cor(use="pairwise.complete.obs") %>% round(digit=3)
#> idcode contractFALSE contractTRUE score CEOTRUE
#> idcode 1.000 -0.312 0.312 0.177 0.625
#> contractFALSE -0.312 1.000 -1.000 0.226 0.354
#> contractTRUE 0.312 -1.000 1.000 -0.226 -0.354
#> score 0.177 0.226 -0.226 1.000 0.548
#> CEOTRUE 0.625 0.354 -0.354 0.548 1.000
Note that the two functions handle logical variables data differently (model.matrix
creates two dummy variables for contract, and one dummy variable for CEO (see discussion in the comments section to this Answer), data.matrix
creates a single binary integer variable), which is reflected in the correlation matrix.
reset default options
options(na.action = oldpar)
Session Info
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] knitr_1.33 magrittr_2.0.1 rlang_0.4.11 fastmap_1.1.0
#> [5] fansi_0.5.0 stringr_1.4.0 styler_1.5.1 highr_0.9
#> [9] tools_4.1.1 xfun_0.25 utf8_1.2.2 withr_2.4.2
#> [13] htmltools_0.5.2 ellipsis_0.3.2 yaml_2.2.1 digest_0.6.27
#> [17] tibble_3.1.4 lifecycle_1.0.0 crayon_1.4.1 purrr_0.3.4
#> [21] vctrs_0.3.8 fs_1.5.0 glue_1.4.2 evaluate_0.14
#> [25] rmarkdown_2.10 reprex_2.0.1 stringi_1.7.4 compiler_4.1.1
#> [29] pillar_1.6.2 backports_1.2.1 pkgconfig_2.0.3
Created on 2021-09-19 by the reprex package (v2.0.1)