I want to check if the values in two columns of a dataframe are mismatched and create a new column with this information. I want to use dplyr::mutate
, and I want to be able to handle NA
values. A trivial example can be generated with this code:
library(dplyr)
let <- c("a", "b", NA)
LET <- c("A")
perms <- expand.grid(
let_2 = let,
LET_2 = LET,
let_1 = let,
LET_1 = LET,
stringsAsFactors = FALSE
) %>%
.[ncol(.):1]
> perms
LET_1 let_1 LET_2 let_2
1 A a A a
2 A a A b
3 A a A <NA>
4 A b A a
5 A b A b
6 A b A <NA>
7 A <NA> A a
8 A <NA> A b
9 A <NA> A <NA>
I then want to check if the parameters in group 1
mismatch the same parameter in group 2
. This is the desired output:
> good_perms
LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1 A a A a FALSE FALSE
2 A a A b FALSE TRUE
3 A a A <NA> FALSE TRUE
4 A b A a FALSE TRUE
5 A b A b FALSE FALSE
6 A b A <NA> FALSE TRUE
7 A <NA> A a FALSE TRUE
8 A <NA> A b FALSE TRUE
9 A <NA> A <NA> FALSE FALSE
I think the code below should work, but it gives the following output:
good_perms1 <- perms %>%
dplyr::mutate(LET_mismatch = !isTRUE(LET_1 == LET_2)) %>%
dplyr::mutate(let_mismatch = !isTRUE(let_1 == let_2))
> good_perms1
LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1 A a A a TRUE TRUE
2 A a A b TRUE TRUE
3 A a A <NA> TRUE TRUE
4 A b A a TRUE TRUE
5 A b A b TRUE TRUE
6 A b A <NA> TRUE TRUE
7 A <NA> A a TRUE TRUE
8 A <NA> A b TRUE TRUE
9 A <NA> A <NA> TRUE TRUE
This code also fails to give the desired output:
good_perms2 <- perms %>%
dplyr::mutate(LET_mismatch = isFALSE(LET_1 == LET_2)) %>%
dplyr::mutate(let_mismatch = isFALSE(let_1 == let_2))
> good_perms2
LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1 A a A a FALSE FALSE
2 A a A b FALSE FALSE
3 A a A <NA> FALSE FALSE
4 A b A a FALSE FALSE
5 A b A b FALSE FALSE
6 A b A <NA> FALSE FALSE
7 A <NA> A a FALSE FALSE
8 A <NA> A b FALSE FALSE
9 A <NA> A <NA> FALSE FALSE
If I use the code below, the I get the expected results when the values are defined, but I get NA
instead of the desired outcome:
FALSE
when one of the values isNA
TRUE
when both of the values areNA
good_perms2 <- perms %>%
dplyr::mutate(LET_mismatch = (LET_1 != LET_2)) %>%
dplyr::mutate(let_mismatch = (let_1 != let_2))
> good_perms2
LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1 A a A a FALSE FALSE
2 A a A b FALSE TRUE
3 A a A <NA> FALSE NA
4 A b A a FALSE TRUE
5 A b A b FALSE FALSE
6 A b A <NA> FALSE NA
7 A <NA> A a FALSE NA
8 A <NA> A b FALSE NA
9 A <NA> A <NA> FALSE NA
I realize that there may be three issues here, but the first one is what I'm most confused about:
- Why does
dplyr::mutate
evaluate!isTRUE
toTRUE
for both!isTRUE("a" == "a")
and!isTRUE("a" == "b")
? Similarly forisFALSE
. - How can I (ideally in one function) identify
NA == "a"
asFALSE
andNA == NA
asTRUE
?
The issue with the NA
s may need to be addressed separately, my primary concern right now is why !isTRUE
isn't behaving as expected from within dplyr::mutate
. Thanks!
P.S. This post touches on this issue, but was solved by different means.