0

Say I have:

df <- data.frame(
  x = c(TRUE, FALSE, FALSE, TRUE),
  y = c(100, 100, 140, 180)
)

Then I want to have z with the value of y if x is TRUE and 0 otherwise:

# First option with type coercion:
df$z <- df$y * df$x
# Second option with vectorized if ... else ...:
df$z <- dplyr::if_else(df$x, df$y, 0)

Another example, I want to add 500 TO y if x is FALSE and 0 otherwise

# First option with type coercion:
df$z <- df$y + 500 * !df$x
# Second option with vectorized if ... else ...:
df$z <- dplyr::if_else(!df$x, df$y + 500, df$y)

I usually use the conditional option, but sometimes it seems like a nice shortcut to use coercion.

Santiago
  • 641
  • 3
  • 14
  • Are there any objective criteria you are hoping to improve? Otherwise this sounds pretty subjective. – Jon Spring Nov 16 '22 at 15:59
  • [here's some discussion about your problem](https://stackoverflow.com/questions/5554725/which-value-is-better-to-use-boolean-true-or-integer-1) – Juan C Nov 16 '22 at 17:29
  • I prefer the more explicit (conditional) option, but voting to close as opinion-based - I don't think there's another answer (there may be performance differences between these two approaches, but it would be unusual for performance differences in this step to be the bottleneck in your overall performance) – Ben Bolker Nov 16 '22 at 18:35

1 Answers1

0

To address whether or not it is bad coding style, that would depend if the project to which you are contributing already has an existing style guide or not. If not, then so long as you are comfortable that is fine. I would recommend adding a comment to the source code noting that you are taking advantage of R's interpreting TRUE and FALSE as 1 and 0 respectively for the next coder (which is usually you in six weeks!).

As for timing, here are some tests which show that in a simple case, the implicit conversion is the fastest, followed by R base naturally vectorized ifelse followed by dplyr. If your data is actually huge, you may also want to consider using data.table.

DF <- data.frame(
  x = c(TRUE, FALSE, FALSE, TRUE),
  y = c(100, 100, 140, 180)
)

library(microbenchmark)
MB <- microbenchmark(DF$z <- DF$y + 500 * !DF$x,
                     DF$z <- dplyr::if_else(!DF$x, DF$y + 500, DF$y),
                     DF$z <- ifelse(!DF$x, DF$y + 500, DF$y),
                     DF$z <- DF$y + ifelse(!DF$x, 500, 0),
                     control = list(order = 'block'),
                     check = 'identical',
                     setup = "DF$z <- NULL",
                     times = 1000L)

print(MB, order = "median")
Unit: microseconds
                                            expr  min   lq     mean median     uq    max neval cld
                      DF$z <- DF$y + 500 * !DF$x  6.6  6.9   7.0910    6.9   7.10   31.6  1000  a 
            DF$z <- DF$y + ifelse(!DF$x, 500, 0)  9.5  9.8  10.4312   10.0  10.30   31.4  1000  a 
         DF$z <- ifelse(!DF$x, DF$y + 500, DF$y) 10.7 11.1  11.5493   11.3  11.50   67.8  1000  a 
 DF$z <- dplyr::if_else(!DF$x, DF$y + 500, DF$y) 96.1 98.5 115.6216  103.0 128.85 3411.8  1000   b
Avraham
  • 1,655
  • 19
  • 32