6

I want to check if the values in two columns of a dataframe are mismatched and create a new column with this information. I want to use dplyr::mutate, and I want to be able to handle NA values. A trivial example can be generated with this code:

library(dplyr)
let <- c("a", "b", NA)
LET <- c("A")
perms <- expand.grid(
    let_2 =  let, 
    LET_2 =  LET, 
    let_1  =  let, 
    LET_1  =  LET, 
    stringsAsFactors = FALSE
) %>% 
    .[ncol(.):1]

> perms
  LET_1 let_1 LET_2 let_2
1     A     a     A     a
2     A     a     A     b
3     A     a     A  <NA>
4     A     b     A     a
5     A     b     A     b
6     A     b     A  <NA>
7     A  <NA>     A     a
8     A  <NA>     A     b
9     A  <NA>     A  <NA>

I then want to check if the parameters in group 1 mismatch the same parameter in group 2. This is the desired output:

> good_perms
  LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1     A     a     A     a        FALSE        FALSE
2     A     a     A     b        FALSE         TRUE
3     A     a     A  <NA>        FALSE         TRUE
4     A     b     A     a        FALSE         TRUE
5     A     b     A     b        FALSE        FALSE
6     A     b     A  <NA>        FALSE         TRUE
7     A  <NA>     A     a        FALSE         TRUE
8     A  <NA>     A     b        FALSE         TRUE
9     A  <NA>     A  <NA>        FALSE        FALSE

I think the code below should work, but it gives the following output:

good_perms1 <- perms %>% 
    dplyr::mutate(LET_mismatch = !isTRUE(LET_1 == LET_2)) %>% 
    dplyr::mutate(let_mismatch = !isTRUE(let_1 == let_2)) 

> good_perms1
  LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1     A     a     A     a         TRUE         TRUE
2     A     a     A     b         TRUE         TRUE
3     A     a     A  <NA>         TRUE         TRUE
4     A     b     A     a         TRUE         TRUE
5     A     b     A     b         TRUE         TRUE
6     A     b     A  <NA>         TRUE         TRUE
7     A  <NA>     A     a         TRUE         TRUE
8     A  <NA>     A     b         TRUE         TRUE
9     A  <NA>     A  <NA>         TRUE         TRUE

This code also fails to give the desired output:

good_perms2 <- perms %>% 
    dplyr::mutate(LET_mismatch = isFALSE(LET_1 == LET_2)) %>% 
    dplyr::mutate(let_mismatch = isFALSE(let_1 == let_2)) 

> good_perms2
  LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1     A     a     A     a        FALSE        FALSE
2     A     a     A     b        FALSE        FALSE
3     A     a     A  <NA>        FALSE        FALSE
4     A     b     A     a        FALSE        FALSE
5     A     b     A     b        FALSE        FALSE
6     A     b     A  <NA>        FALSE        FALSE
7     A  <NA>     A     a        FALSE        FALSE
8     A  <NA>     A     b        FALSE        FALSE
9     A  <NA>     A  <NA>        FALSE        FALSE

If I use the code below, the I get the expected results when the values are defined, but I get NA instead of the desired outcome:

  • FALSE when one of the values is NA
  • TRUE when both of the values are NA
good_perms2 <- perms %>% 
    dplyr::mutate(LET_mismatch = (LET_1 != LET_2)) %>% 
    dplyr::mutate(let_mismatch = (let_1 != let_2)) 

> good_perms2
  LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
1     A     a     A     a        FALSE        FALSE
2     A     a     A     b        FALSE         TRUE
3     A     a     A  <NA>        FALSE           NA
4     A     b     A     a        FALSE         TRUE
5     A     b     A     b        FALSE        FALSE
6     A     b     A  <NA>        FALSE           NA
7     A  <NA>     A     a        FALSE           NA
8     A  <NA>     A     b        FALSE           NA
9     A  <NA>     A  <NA>        FALSE           NA

I realize that there may be three issues here, but the first one is what I'm most confused about:

  1. Why does dplyr::mutate evaluate !isTRUE to TRUE for both !isTRUE("a" == "a") and !isTRUE("a" == "b")? Similarly for isFALSE.
  2. How can I (ideally in one function) identify NA == "a" as FALSE and NA == NA as TRUE?

The issue with the NAs may need to be addressed separately, my primary concern right now is why !isTRUE isn't behaving as expected from within dplyr::mutate. Thanks!

P.S. This post touches on this issue, but was solved by different means.

Josh
  • 1,210
  • 12
  • 30
  • 3
    Take a look at what `isTRUE(perms$let_1 == perms$let_2)` gives first. `isTRUE` is not vectorised by the looks of it. From `?isTRUE` "*‘isTRUE(x)’ is an abbreviation of ‘identical(TRUE, x)’, and so is true if and only if ‘x’ is a length-one logical vector*" – thelatemail Aug 27 '19 at 00:21
  • I can't replicate your 1. query either - `!isTRUE("a" == "a")` gives `FALSE` while `!isTRUE("a" == "b")` gives `TRUE` – thelatemail Aug 27 '19 at 00:26
  • 1
    @thelatemail, to your second point, I should have said "from within `dplyr::mutate`". To your first point, I had thought `mutate` evaluates `!isTRUE(perms$let_1[1] == perms$let_2[1]` and then evaluates `!isTRUE(perms$let_1[2] == perms$let_2[1]`, but you've made me realize that `mutate` evaulates `!isTRUE(perms$let_1 == perms$let_2`. I'll need to write a vectorized version of `isTRUE` that deals with my specific criteria for `NA`. Thanks. – Josh Aug 27 '19 at 01:58

3 Answers3

3

Add rowwise()

good_perms1 <- perms %>% rowwise() %>%
    dplyr::mutate(LET_mismatch = !isTRUE(LET_1 == LET_2)) %>% 
    dplyr::mutate(let_mismatch = !isTRUE(let_1 == let_2)) 
Nivel
  • 629
  • 4
  • 12
2

You encountered this issue because isTRUE and isFALSE are not vectorized functions. Accordingly to ?isTRUE:

isTRUE(x) is the same as { is.logical(x) && length(x) == 1 && !is.na(x) && x }; isFALSE() is defined analogously. Consequently, if(isTRUE(cond)) may be preferable to if(cond) because of NAs.

Knowing what I just showed above, let's see your questions.

  1. Why does dplyr::mutate evaluate !isTRUE to TRUE for both !isTRUE("a" == "a") and !isTRUE("a" == "b")? Similarly for isFALSE.
x <- "a" == "a" # TRUE
y <- "a" == "b" # FALSE
!isTRUE(x)
#> [1] FALSE
!isTRUE(y)
#> [1] TRUE
!(is.logical(x) && length(x) == 1 && !is.na(x) && x)
#> [1] FALSE
!(is.logical(y) && length(y) == 1 && !is.na(y) && y)
#> [1] TRUE

All ok here. Now, let's try the same with non-scalar objects.

let_1 <- c("a", "a", "a", "b", "b", "b", NA, NA, NA)
let_2 <- c("a", "b", NA, "a", "b", NA, "a", "b", NA)
let_1 == let_2
#> [1]  TRUE FALSE    NA FALSE  TRUE    NA    NA    NA    NA
!isTRUE(let_1 == let_2)
#> TRUE
x <- (let_1 == let_2)
!(is.logical(x) && length(x) == 1 && !is.na(x) && x)
#> TRUE

As you can see, isTRUE returns a length 1 object. When using mutate, the function recycles the value to all elements, that's why all of let_mismatch is equal to TRUE.

  1. How can I (ideally in one function) identify NA == "a" as FALSE and NA == NA as TRUE?

@www already showed a way to go around this. A better way is using a vectorized if-else, like dplyr::case_when().

You can do this using a function, but that's not the best way (example below). You can also create a function that process expressions, if you want to assign the logical expression directly (also not the best way for your problem).

library(dplyr)

foo <- function(x, y) {
    case_when(
        is.na(x) & !is.na(y) ~ FALSE,
        is.na(x) & is.na(y) ~ TRUE)
}

foo(NA, "a")
#> [1] FALSE
foo(NA, NA)
#> [1] TRUE

Here is a case_when() solution using the example that you showed.

library(dplyr)

good_perms1 <- perms %>% 
    mutate(LET_mismatch = case_when(
        LET_1 == LET_2 ~ FALSE,
        is.na(LET_1) & is.na(LET_2) ~ FALSE,
        TRUE ~ TRUE),
    let_mismatch = case_when(
        let_1 == let_2 ~ FALSE,
        is.na(let_1) & is.na(let_2) ~ FALSE,
        TRUE ~ TRUE))

good_perms1
#>   LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
#> 1     A     a     A     a        FALSE        FALSE
#> 2     A     a     A     b        FALSE         TRUE
#> 3     A     a     A  <NA>        FALSE         TRUE
#> 4     A     b     A     a        FALSE         TRUE
#> 5     A     b     A     b        FALSE        FALSE
#> 6     A     b     A  <NA>        FALSE         TRUE
#> 7     A  <NA>     A     a        FALSE         TRUE
#> 8     A  <NA>     A     b        FALSE         TRUE
#> 9     A  <NA>     A  <NA>        FALSE        FALSE

1

Perhaps replace NA with character "NA", run your code, and then replace character "NA" back with NA.

library(dplyr)

good_perms2 <- perms %>% 
  mutate_all(list(~replace(., is.na(.), "NA"))) %>%
  mutate(LET_mismatch = (LET_1 != LET_2)) %>% 
  mutate(let_mismatch = (let_1 != let_2)) %>%
  mutate_all(list(~replace(., . %in% "NA", NA_character_)))
good_perms2 
#   LET_1 let_1 LET_2 let_2 LET_mismatch let_mismatch
# 1     A     a     A     a        FALSE        FALSE
# 2     A     a     A     b        FALSE         TRUE
# 3     A     a     A  <NA>        FALSE         TRUE
# 4     A     b     A     a        FALSE         TRUE
# 5     A     b     A     b        FALSE        FALSE
# 6     A     b     A  <NA>        FALSE         TRUE
# 7     A  <NA>     A     a        FALSE         TRUE
# 8     A  <NA>     A     b        FALSE         TRUE
# 9     A  <NA>     A  <NA>        FALSE        FALSE
www
  • 38,575
  • 12
  • 48
  • 84
  • Converting `NA` to `"NA"` is going to be way easier than the complicated `is.na` function I was going to write. Thanks! – Josh Aug 27 '19 at 02:01
  • `mutate_all(~replace(., is.na(.), "NA"))` seems to work. Is there a use case where `mutate_all(list(~replace(., is.na(.), "NA")))` is necessary? Also, I'd never seen `NA_character_` before. That will be useful. Thanks! – Josh Aug 27 '19 at 02:08
  • 1
    You can create a function `f1 <- function(x, y) replace(x, is.na(x), "999") != replace(y, is.na(y), "999"); perms %>% mutate(LET_mismatch = f1(LET_1, LET_2), let_mismatch = f1(let_1, let_2))` – akrun Aug 27 '19 at 03:22