Using dplyr to conditionally replace values in a column

Question

I have an example data set with a column that reads somewhat like this:

Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee

What I'd like to do is replace it into just two factors - "Candy" and "Non-Candy". I can do this with Python/Pandas, but can't seem to figure out a dplyr based solution. Thank you!

leerssej · Answer 1 · 2019-06-18T23:00:35.600

89

In dplyr and tidyr

dat %>% 
    mutate(var = replace(var, var != "Candy", "Not Candy"))

Significantly faster than the ifelse approaches. Code to create the initial dataframe can be as below:

library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"

edited Jun 18 '19 at 23:00

answered Mar 24 '17 at 04:56

leerssej

14,260
6
48
57

1

Isn't there a function that does not require to repeat `var`? – Julien Sep 18 '22 at 06:39
The only function that I am familiar with that autopopulates the conditional statement is [replace_na()](https://tidyr.tidyverse.org/reference/replace_na.html) explanation: The first `var` refs the output name you want. The second `var` the input column and the third `var` specifies the column to use in the conditional statement you are applying. A function missing one of these would have to assume (hard-code in) one or more of these [input, output, conditional column]. – leerssej Sep 20 '22 at 21:20

score 20 · Answer 2 · answered May 04 '20 at 22:18

Another solution with dplyr using case_when:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           TRUE ~ 'Non-Candy'))

The syntax for case_when is condition ~ value to replace. Documentation here.

Probably less efficient than the solution using replace, but an advantage is that multiple replacements could be performed in a single command while still being nicely readable, i.e. replacing to produce three levels:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           var == 'Water' ~ 'Water',
                           TRUE ~ 'Neither-Water-Nor-Candy'))

eipi10 · Accepted Answer · 2016-02-24T19:01:14.947

17

Assuming your data frame is dat and your column is var:

dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))

edited Feb 24 '16 at 19:01

answered Feb 24 '16 at 18:48

eipi10

91,525
24
209
285

@RichardScriven's approach (comments on mine) strictly dominates this – MichaelChirico Feb 24 '16 at 20:53

MichaelChirico · Answer 4 · 2022-08-17T04:13:53.853

No need for dplyr. Assuming var is stored as a factor already:

non_c <- setdiff(levels(dat$var), "Candy")
    
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)

See ?levels.

This is much more efficient than the ifelse approach, which is bound to be slow:

library(microbenchmark)
set.seed(01239)
# resample data
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
    
timings <- replicate(50, {
  # copy data to facilitate reuse
  cop <- smp
  t0 <- get_nanotime()
  levs <- setdiff(levels(cop$var), "Candy")
  levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
  t1 <- get_nanotime() - t0

  cop <- smp
  t0 <- get_nanotime()
  cop = cop %>%
    mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
  t2 <- get_nanotime() - t0

  cop <- smp
  t0 <- get_nanotime()
  cop$var <- 
    factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
  t3 <- get_nanotime() - t0
  c(levels = t1, dplyr = t2, direct = t3)
})

x <- apply(times, 1, median)
x[2]/x[1]
#    dplyr   direct 
# 8.894303 4.962791

That is, this is 9 times faster.

Or also `factor(dat$var == "Candy", labels = c("Non-Candy", "Candy"))` but I think resetting the levels is a nice way to go. — Rich Scriven, Feb 24 '16 at 20:43

score 2 · Answer 5 · answered Apr 28 '22 at 15:15

I didn't benchmark this, but at least in some cases with more than one condition, a combination of mutate and a list seems to provide an easy solution:

# assuming that all sweet things fall in one category

dat <- data.frame(var = c("Candy", "Sanitizer", "Candy", "Water", "Cake", "Candy", "Ice Cream", "Gum", "Candy", "Coffee"))

conditions <- list("Candy" = TRUE, "Sanitizer" = FALSE, "Water" = FALSE, 
"Cake" = TRUE, "Ice Cream" = TRUE, "Gum" = TRUE, "Coffee" = FALSE)

dat %>% mutate(sweet = conditions[var])

score 1 · Answer 6 · answered Nov 29 '20 at 00:23

When you only need two values, a simple ifelse() is prettiet, I think.

Furthermore, embedded ifelses can simulate the same situation as the case_when solution proposed by PhJ (I do like his readability, though)!

dat %>%
    mutate(
        var = ifelse(var == "Candy", "Candy", "Non-Candy")
    )

score 1 · Answer 7 · answered Feb 27 '23 at 08:10

1

A newer solution is to use case_match from dplyr

library(dplyr)
dat %>% 
    mutate(var = case_match(var, "Candy" ~ var, .default ~ "Not Candy"))

answered Feb 27 '23 at 08:10

Julien

1,613
1
10
26

Using dplyr to conditionally replace values in a column

7 Answers7

Linked

Related