How do I write a function that uses dplyr to check that a variable has no NA values?

Question

I have simple code that creates arbitrary example data:

library(assertr)
library(tidyverse)
set.seed(1)
df <- tibble(id = 1:10, value = rnorm(10, 0, 1)) %>%
  mutate(value = if_else(abs(value) < 0.5, NA_real_, value))

The data looks like this:

> df
# A tibble: 10 x 2
      id   value
   <int>   <dbl>
 1     1  -0.626
 2     2  NA    
 3     3  -0.836
 4     4   1.60 
 5     5  NA    
 6     6  -0.820
 7     7  NA    
 8     8   0.738
 9     9   0.576
10    10  NA

Now, I'm trying to write a function that checks if any rows in a given column (in this case, the value column) have NA values and throws an error if they do. If they don't, it should return the original data, unmodified, so that the pipe can continue. This is simple without a function:

df %>% verify(sum(is.na(value)) == 0)

# Outputs "Error: assertr stopped execution"

Wrapping this in a function causes difficulty, however. I tried using lazyeval:

verify_not_missing <- function(.data, v) {
  .data %>% verify(sum(is.na(lazyeval::lazy(v))) == 0)
}
df %>% verify_not_missing(value)

But this doesn't throw any error or stop execution. It silently continues execution. Similarly, from the dplyr programming vignette, I thought the following would work:

verify_not_missing <- function(.data, v) {
  .data %>% verify(sum(is.na(!! quo(v))) == 0)
}
df %>% verify_not_missing(value)

but that throws an error:

Error in is_quosure(e2) : argument "e2" is missing, with no default

I searched through some of the documentation and SO, including this question, but some of the answers mention deprecated parts of dplyr that aren't much help (case in point, calling vignette("nse") reveals that the vignette no longer exists).

What am I missing here?

_{I'm using R v3.5.1, dplyr v0.7.7, and assertr v2.5 on an x64 Linux system}

@MikeH. I still get the same error if I switch out `quo` for `enquo` — Michael A, Dec 17 '18 at 03:33
I've tried a million things but can't solve it yet. But first, you can simplify the non-function to ```df %>% verify(is.na(value))``` right? — twedl, Dec 17 '18 at 03:41
It looks like `assertr` doesn't play very nicely with `dplyr`. You might need to do a non-standard evaluation without using dplyr's nice syntax — Mike H., Dec 17 '18 at 03:46
[Note that the development version of assertr implements ```rlang::eval_tidy``` internally, which may affect an answer for this.] — twedl, Dec 17 '18 at 04:12

Taher A. Ghaleb · Answer 1 · 2018-12-18T19:59:09.717

There are three possible ways to achieve this:

First approach

Using eval() with substitute(), like this:

verify_not_missing <- function(.data, v) {
  v <- eval(substitute(v), .data)
  .data %>% 
    verify(sum(is.na(v)) == 0)
}

Second approach

Using rlang::eval_tidy() with enquo(), like this:

verify_not_missing <- function(.data, v) {
  v <- rlang::eval_tidy(enquo(v), .data)
  .data %>% 
    verify(sum(is.na(v)) == 0)
}

Third approach

Using !!enquo() inside select() (you would need colnames(.data) to get the other columns)

verify_not_missing <- function(.data, v) {
  .data %>% 
    select(colnames(.data), v = !!enquo(v)) %>%
    verify(sum(is.na(v)) == 0)
}

df %>% verify_not_missing(value)

All of them produce the same result, which, using your data, looks like the following:

#verification [sum(is.na(v)) == 0] failed! (1 failure)

#    verb  redux_fn           predicate  column  index  value
#1 verify        NA  sum(is.na(v)) == 0      NA      1     NA

#Error: assertr stopped execution

Hope it helps.

This seems to work, but is there a way to do this that won't break the pipe if `verify_not_missing` doesn't find missing values? Adding `return(.data)` to the function works, but I figured I'd ask in case there's a better way. — Michael A, Dec 17 '18 at 16:53
Try adding `%>% mutate(new_var = 2)` after calling `df %>% verify_not_missing(value)`. Your code drops the `id` variable and all other variables in the data frame except `value`, which means that trying to use your version of `verify_not_missing` in a pipe doesn't work. It breaks the pipe because it changes the data irreversibly. — Michael A, Dec 17 '18 at 23:52

younggeun · Accepted Answer · 2018-12-19T14:31:44.310

If you do not have to use assertr package, I think this solution can be considered.

library(tidyverse)

verify_not_missing <- function(.data) {
  col_na <- colSums(is.na(.data)) > 0 # larger than zero, than na value in that column
  if (any(col_na)) stop(gettextf("column %s is missing", 
                                 str_c(names(col_na)[col_na], collapse = ", ")))
}

By using colSums(is.na(.)), you can detect columns with NA values. If there is such column, it might be easy to print error message with its column names.

Also, I collapse names() for the multiple column case.

Applying to your dataset, we can get the result:

df %>% 
  verify_not_missing()
#> Error in verify_not_missing(.): column value is missing

Similarly, for additional column with NA values,

(mydf2 <- tibble(id = 1:10, value = rnorm(10, 0, 1)) %>%
  mutate(value1 = if_else(abs(value) < 0.5, NA_real_, value),
         value2 = if_else(abs(value) < 0.5, NA_real_, value)))
#> # A tibble: 10 x 4
#>       id   value  value1  value2
#>    <int>   <dbl>   <dbl>   <dbl>
#>  1     1  1.51     1.51    1.51 
#>  2     2  0.390   NA      NA    
#>  3     3 -0.621   -0.621  -0.621
#>  4     4 -2.21    -2.21   -2.21 
#>  5     5  1.12     1.12    1.12 
#>  6     6 -0.0449  NA      NA    
#>  7     7 -0.0162  NA      NA    
#>  8     8  0.944    0.944   0.944
#>  9     9  0.821    0.821   0.821
#> 10    10  0.594    0.594   0.594

mydf2 %>% 
  verify_not_missing()
#> Error in verify_not_missing(.): column value1, value2 is missing

It prints value1, value2 which include NA.

Edit - Adding column argument

You can just enquo(v) and then use %>% select(!!v). Then it returns columns for v. The remaining parts are the same.

verify_not_missing2 <- function(.data, v) {
  v <- enquo(v)
  col_na <-
    .data %>% 
    select(!!v) %>% # this returns v columns
    is.na() %>%
    colSums()
  col_na <- col_na > 0
  if (any(col_na)) stop(gettextf("column %s is missing", 
                                 str_c(names(col_na)[col_na], collapse = ", ")))
}

Applying this to the example,

df %>% 
  verify_not_missing2(value)
#> Error in verify_not_missing2(., value): column value is missing

Specifying value as argument, you can get error. For the multiple NA columns, in addition,

mydf2 %>% 
  verify_not_missing2(value)
#---------------------------
mydf2 %>% 
  verify_not_missing2(value1)
#> Error in verify_not_missing2(., value1): column value1 is missing

When you input column which is neither value1 nor value2, then nothing will printed. On the other hand, you will get error with value1 specified.

Also, you can specify multiple columns with c().

mydf2 %>% 
  verify_not_missing2(v = c("value1", "value2"))
#> Error in verify_not_missing2(., v = c("value1", "value2")): column value1, value2 is missing
#----------------------------
mydf2 %>% 
  verify_not_missing2(v = c(value1, value2))
#> Error in verify_not_missing2(., v = c(value1, value2)): column value1, value2 is missing

Edit2 - Returing original Data

verify_not_missing3 <- function(.data, v) {
  v <- enquo(v)
  col_na <-
    .data %>% 
    select(!!v) %>% 
    is.na() %>% 
    colSums()
  col_na <- col_na > 0
  if (any(col_na)) {
    stop(gettextf("column %s is missing", 
                                 str_c(names(col_na)[col_na], collapse = ", ")))
  } else {
    .data
  }
}

Additional else { .data } statement can return in non-error case.

If you gives value,

mydf2 %>% 
  verify_not_missing3(value)
#> # A tibble: 10 x 4
#>       id   value  value1  value2
#>    <int>   <dbl>   <dbl>   <dbl>
#>  1     1  1.51     1.51    1.51 
#>  2     2  0.390   NA      NA    
#>  3     3 -0.621   -0.621  -0.621
#>  4     4 -2.21    -2.21   -2.21 
#>  5     5  1.12     1.12    1.12 
#>  6     6 -0.0449  NA      NA    
#>  7     7 -0.0162  NA      NA    
#>  8     8  0.944    0.944   0.944
#>  9     9  0.821    0.821   0.821
#> 10    10  0.594    0.594   0.594

On the other hand,

mydf2 %>% 
  verify_not_missing3(value1)
#> Error in verify_not_missing3(., value1): column value1 is missing

Thanks for this. I don't need (and in fact, don't want) the code to err out if *any* column in the tibble has `NA` values; only if certain columns do, hence why my original attempt at a function accepted a column name as an argument. — Michael A, Dec 19 '18 at 00:51
These break the pipe though, right? Since they don't return the original data if no errors were found? — Michael A, Dec 19 '18 at 14:19
I did't know that you want to return the original data. Sorry for that. I think I keep mistaking. Since it has `if()` statement, I think it is easy to add that feature. I'll edit. — younggeun, Dec 19 '18 at 14:22

score 0 · Answer 3 · answered Dec 19 '18 at 14:29

0

Here is how you could do something similar in base R:

verify_not_missing <- function(.data, v) {
  !any(
    is.na(
      .data[[deparse(substitute(v))]]
    )
  )
} 

verify_not_missing(df, value)
[1] FALSE

answered Dec 19 '18 at 14:29

s_baldur

29,441
4
36
69

How do I write a function that uses dplyr to check that a variable has no NA values?

3 Answers3

Edit - Adding column argument

Edit2 - Returing original Data