60

I am trying to use dplyr::case_when within dplyr::mutate to create a new variable where I set some values to missing and recode other values simultaneously.

However, if I try to set values to NA, I get an error saying that we cannot create the variable new because NAs are logical:

Error in mutate_impl(.data, dots) :
Evaluation error: must be type double, not logical.

Is there a way to set values to NA in a non-logical vector in a data frame using this?

library(dplyr)    

# Create data
df <- data.frame(old = 1:3)

# Create new variable
df <- df %>% dplyr::mutate(new = dplyr::case_when(old == 1 ~ 5,
                                                  old == 2 ~ NA,
                                                  TRUE ~ old))

# Desired output
c(5, NA, 3)
Scarabee
  • 5,437
  • 5
  • 29
  • 55
socialscientist
  • 3,759
  • 5
  • 23
  • 58

2 Answers2

82

As said in ?case_when:

All RHSs must evaluate to the same type of vector.

You actually have two possibilities:

1) Create new as a numeric vector

df <- df %>% mutate(new = case_when(old == 1 ~ 5,
                                    old == 2 ~ NA_real_,
                                    TRUE ~ as.numeric(old)))

Note that NA_real_ is the numeric version of NA, and that you must convert old to numeric because you created it as an integer in your original dataframe.

You get:

str(df)
# 'data.frame': 3 obs. of  2 variables:
# $ old: int  1 2 3
# $ new: num  5 NA 3

2) Create new as an integer vector

df <- df %>% mutate(new = case_when(old == 1 ~ 5L,
                                    old == 2 ~ NA_integer_,
                                    TRUE ~ old))

Here, 5L forces 5 into the integer type, and NA_integer_ is the integer version of NA.

So this time new is integer:

str(df)
# 'data.frame': 3 obs. of  2 variables:
# $ old: int  1 2 3
# $ new: int  5 NA 3
Scarabee
  • 5,437
  • 5
  • 29
  • 55
  • 16
    You can also do `as.numeric(NA)` or `as.integer(NA)` for the `NA` cases, as `NA_real_` and `NA_integer_` are a bit annoying to remember and rarely used outside of things like this. – Marius Jul 03 '17 at 23:43
  • Nice. Also, to show: identical(NA_real_,as.numeric(NA)) produces TRUE. – socialscientist Jul 04 '17 at 04:36
  • @hadley This answer is now clear to me, but it took me awhile to figure out. It would be very helpful to have an example of this in the tidyverse `case_when` documentation. In my case when missing all values for grouped data, mean(x[1:2], na.rm=T) were generating an NaN result. recoding those cases to NA_real_ fixed it. – Matt L. Dec 19 '17 at 22:54
  • 1
    @MattL. Hadley won't be pinged, see [here](https://meta.stackexchange.com/questions/43019/how-do-comment-replies-work) for more details on how notifications work. However the documentation was clarified last week by this PR: https://github.com/tidyverse/dplyr/pull/3247 – Scarabee Dec 19 '17 at 23:49
  • 5
    When inserting `NA` placeholders into a column of class `Date`, use `as.Date(NA)` to produce the `NA` – seasmith May 10 '18 at 17:09
  • @Marius's solution also works with as.character. Simple and versatile – Jake Jul 21 '21 at 01:45
5

Try this ?

df %>% dplyr::mutate(new = dplyr::case_when(.$old == 1 ~ 5,
                                                  .$old == 2 ~ NA_real_,
                                                  TRUE~.$old))

> df
  old new
1   1   5
2   2  NA
3   3   3
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    does `.$old` make a difference? Doesn't this return the same as `old`, as we're dealing with the full dataframe? – pluke Aug 10 '21 at 22:10