1

I have already looked on SO for an answer to this question, but didn't manage to find a solution to my problem.

I have a dataframe with several columns, each of which has at least one NA. Names of these columns are stored in character vector vars_na. For each of those, I would like to create a dummy variable taking value 0 if the value for that observation is missing, and 1 otherwise.

Below there is a reproducible toy example and the code I used up to now:

# creation of toy dataset
iris[1:5, 1] <- rep(NA, 5)
iris[1:10, 4] <- rep(NA, 10)
vars_na <- c("Sepal.Length", "Petal.Width")

for(var in vars_na){
  iris <- iris %>% 
    mutate(dummy = ifelse(is.na(!!var), 0, 1)) %>% 
    rename_at(c("dummy"), list(~paste0("dummyna_", var)))
# 'rename_at' is just to differentiate between the several dummies created, 
# and it works correctly
}

The problem is that the newly created dummies result in being vector full of 1's, so they do not consider missing values correctly; indeed:

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species dummyna_Sepal.Length dummyna_Petal.Width
1           NA         3.5          1.4          NA  setosa                    1                   1
2           NA         3.0          1.4          NA  setosa                    1                   1
3           NA         3.2          1.3          NA  setosa                    1                   1
4           NA         3.1          1.5          NA  setosa                    1                   1
5           NA         3.6          1.4          NA  setosa                    1                   1
6          5.4         3.9          1.7          NA  setosa                    1                   1

but I would like to obtain

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species dummyna_Sepal.Length dummyna_Petal.Width
1           NA         3.5          1.4          NA  setosa                    0                   0
2           NA         3.0          1.4          NA  setosa                    0                   0
3           NA         3.2          1.3          NA  setosa                    0                   0
4           NA         3.1          1.5          NA  setosa                    0                   0
5           NA         3.6          1.4          NA  setosa                    0                   0
6          5.4         3.9          1.7          NA  setosa                    1                   0

The code is simple and I believed it should work. What am I doing wrong instead? Thanks in advance.

Ric S
  • 9,073
  • 3
  • 25
  • 51

1 Answers1

6

The problem is that since var is a character, something like is.na(!!var) ends up as something like is.na("Sepal.Length"), which is always false.

You can use rlang::sym* to transform characters to symbols that can be evaluated by mutate for example:

for (var in vars_na) {
  var_sym <- rlang::sym(var)
  new_name <- rlang::sym(paste0(var, "_na"))

  iris <- iris %>%
    mutate(!!new_name := as.integer(!is.na(!!var_sym)))
}

*The rlang package serves at the basis for most of the non-standard evaluation dplyr supports, see tidy evaluation.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Puh, sometimes I see `quote`, `quo`, or `sym`. When to use what? I did not manage to get it done with `quo` ... – DSGym Jun 22 '19 at 16:15
  • 1
    @DSGym it's not easy to make sense of it all, took me a lot of trial and error. For those in specific: `quote` is base R, `quo` is similar but from `rlang` and also captures environments, `sym` is exclusively to transform from character to symbol (whereas `quote`/`quo` could capture a character too). Definitely give that tidy evaluation link a read, it's not too long. – Alexis Jun 22 '19 at 16:19
  • @Alexis thank you very much for your suggestion, I thought to have a decent understanding of this concepts but clearly I have to study it more. Thanks again! – Ric S Jun 22 '19 at 23:40