3

I have a dataframe in which I want to check some condition and need to add a new column based on result of condition.

This is my input data

InputData = data.frame(A = c("", "", "Apple"), B = c("", "", "Orange"), C = c("", "", ""), D = c(0, 1, 1))

This is my desired output

OutputData = InputData %>%
  mutate(R = case_when(A=='' & B=='' & C=='' & D==0 ~ "Yes", TRUE ~ "No"))

I have tried mutate with Case function. It is working fine but it takes longer time when I have more number of rows.

Please help me to do it in faster way.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • Why are you copying the entire data.frame when you just want to add a column? How large is your data.frame? What are the data types / classes of columns A to D? Is `D` really a character column as in the example? Why is it not numeric with `NA` for missing values? – Roland Mar 30 '23 at 07:20
  • My data frame is having more than 1lac rows and Column D is numeric column. This InputData is just a sample of my problem statement – ParthaSarathi Mar 30 '23 at 07:26

2 Answers2

2

I'm surprised that your code is slow with such small data (only 100k rows). I would do it like this:

InputData$R <- "No"
InputData[InputData$A == '' & InputData$B == '' &
            InputData$C == '' & InputData$D == 0, "R"] <- "Yes"

However, I strongly recommend using logical values instead of "Yes"/"No":

InputData$S <- InputData$A == '' & InputData$B == '' &
  InputData$C == '' & InputData$D == 0
#      A      B C D   R     S
#1                0 Yes  TRUE
#2                1  No FALSE
#3 Apple Orange   1  No FALSE

If that is still too slow, package data.table can help. But it shouldn't be necessary unless the data gets actually large.

Roland
  • 127,288
  • 10
  • 191
  • 288
1

A simple benchmark is following. It seems that if you just want to dichotomize a column, if_else is preferable to case_when in dplyr. If you care about the speed, change the workflow to base like @Roland's answer.

InputData = data.frame(A = sample(c('x', ''), 1e5, TRUE),
                       B = sample(c('x', ''), 1e5, TRUE),
                       C = sample(c('x', ''), 1e5, TRUE),
                       D = sample(0:1, 1e5, TRUE))

library(dplyr)

bench::mark(
  "base::ifelse" = InputData %>% mutate(R = ifelse(A == '' & B == '' & C == '' & D == 0, "Yes", "No")),
  "dplyr::case_when" = InputData %>% mutate(R = case_when(A == '' & B == '' & C == '' & D == 0 ~ "Yes", TRUE ~ "No")),
  "dplyr::if_else" = InputData %>% mutate(R = if_else(A == '' & B == '' & C == '' & D == 0, "Yes", "No")),
  "base::repalce" = InputData %>% mutate(R = "No", R = replace(R, A == '' & B == '' & C == '' & D == 0, "Yes")),
  "base::`[<-`.Roland" = local({
    InputData$R <- "No"
    InputData$R[InputData$A == '' & InputData$B == '' & InputData$C == '' & InputData$D == 0] <- "Yes"
    InputData
  }),
  iterations = 100
)

# # A tibble: 5 × 9
#   expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 base::ifelse        24.87ms  25.82ms      38.0    7.63MB     17.1    69    31      1.82s
# 2 dplyr::case_when    15.65ms  16.91ms      57.0     8.4MB     24.4    70    30      1.23s
# 3 dplyr::if_else       6.77ms   7.17ms     133.     6.87MB     39.6    77    23   580.57ms
# 4 base::repalce         5.6ms    5.9ms     166.     5.75MB     36.4    82    18    495.1ms
# 5 base::`[<-`.Roland   3.47ms   3.52ms     269.     3.84MB     33.2    89    11   331.35ms
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51