40

Consider the iris data:

 iris 
        Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    1            5.1         3.5          1.4         0.2     setosa
    2            4.9         3.0          1.4         0.2     setosa
    3            4.7         3.2          1.3         0.2     setosa
    4            4.6         3.1          1.5         0.2     setosa
    5            5.0         3.6          1.4         0.2     setosa
    6            5.4         3.9          1.7         0.4     setosa
    7            4.6         3.4          1.4         0.3     setosa

I want to create a new column based on a comparison of the values in variable Sepal.Length with a fixed limit / cut-off, e.g. check if the values are larger or smaller than 5:

if Sepal.Length >= 5 assign "UP" else assign "DOWN" to a new column "Regulation".

What's the way to do that?

Henrik
  • 65,555
  • 14
  • 143
  • 159
neversaint
  • 60,904
  • 137
  • 310
  • 477

3 Answers3

74

Try

iris$Regulation <- ifelse(iris$Sepal.Length >=5, "UP", "DOWN")
Oscar de León
  • 2,331
  • 16
  • 18
  • 3
    Can you use a vector in place of ">=5", if you want to check which elements in the df match those in a vector of different length? – FaCoffee Oct 10 '18 at 10:29
  • How can I add to the new column the least value of Sepal.Length, Sepal.Width, Petal.Length, Petal.Width? I tried iris$minimum <- min(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width) but it doesn't work for each column separately. – Mehdi Abbassi Mar 18 '20 at 14:25
25

In the interest of updating a possible canonical, the package dplyr has the function mutate which lets you create a new column in a data.frame in a vectorized fashion:

library(dplyr)
iris_new <- iris %>%
    mutate(Regulation = if_else(Sepal.Length >= 5, 'UP', 'DOWN'))

This makes a new column called Regulation which consists of either 'UP' or 'DOWN' based on applying the condition to the Sepal.Length column.

The case_when function (also from dplyr) provides an easy to read way to chain together multiple conditions:

iris %>%
    mutate(Regulation = case_when(Sepal.Length >= 5 ~ 'High',
                                  Sepal.Length >= 4.5 ~ 'Mid',
                                  TRUE ~ 'Low'))

This works just like if_else except instead of 1 condition with a return value for TRUE and FALSE, each line has condition (left side of ~) and a return value (right side of ~) that it returns if TRUE. If false, it moves on to the next condition.

In this case, rows where Sepal.Length >= 5 will return 'High', rows where Sepal.Length < 5 (since the first condition had to fail) & Sepal.Length >= 4.5 will return 'Mid', and all other rows will return 'Low'. Since TRUE is always TRUE, it is used to provide a default value.

divibisan
  • 11,659
  • 11
  • 40
  • 58
6

Without ifelse:

iris$Regulation <- c("DOWN", "UP")[ (iris$Sepal.Length >= 5) + 1 ]

Benchmark, about 14x faster than ifelse:

bigX <- runif(10^6, 0, 10)

bench::mark(
  x1 = c("DOWN", "UP")[ (bigX >= 5) + 1 ],
  x2 = ifelse(bigX >=5, "UP", "DOWN"),
  x3 = dplyr::if_else(bigX >= 5, "UP", "DOWN")
)
# # A tibble: 3 x 14
# expression     min    mean  median     max `itr/sec` mem_alloc  n_gc n_itr total_time result memory
# <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list>
# x1          19.1ms  23.9ms  20.5ms  31.6ms     41.9     22.9MB     9    22      525ms <chr ~ <Rpro~
# x2         278.9ms 280.2ms 280.2ms 281.5ms      3.57   118.3MB     4     2      560ms <chr ~ <Rpro~
# x3          47.8ms  64.2ms  54.1ms 138.8ms     15.6     68.7MB    11     8      514ms <chr ~ <Rpro~
zx8754
  • 52,746
  • 12
  • 114
  • 209