119

In a large dataframe ("myfile") with four columns I have to add a fifth column with values conditionally based on the first four columns.

Prefer answers with dplyr and mutate, mainly because of its speed in large datasets.

My dataframe looks like this:

  V1 V2 V3 V4
1  1  2  3  5
2  2  4  4  1
3  1  4  1  1
4  4  5  1  3
5  5  5  5  4
...

The values of the fifth column (V5) are based on some conditional rules:

if (V1==1 & V2!=4) {
  V5 <- 1
} else if (V2==4 & V3!=1) {
  V5 <- 2
} else {
  V5 <- 0
}

Now I want to use the mutate function to use these rules on all rows (to avoid slow loops). Something like this (and yes, I know it doesn't work this way!):

myfile <- mutate(myfile, if (V1==1 & V2!=4){V5 = 1}
    else if (V2==4 & V3!=1){V5 = 2}
    else {V5 = 0})

This should be the result:

  V1 V2 V3 V4 V5
1  1  2  3  5  1
2  2  4  4  1  2
3  1  4  1  1  0
4  4  5  1  3  0
5  5  5  5  4  0

How to do this in dplyr?

smci
  • 32,567
  • 20
  • 113
  • 146
rdatasculptor
  • 8,112
  • 14
  • 56
  • 81
  • It's useful to state if V1..4 are all integer (not factor, logical, string or float)? and do you care about correctly handling `NA`, (`NaN, +Inf, -Inf`)? – smci Mar 14 '19 at 01:02
  • If speed seems to be an issue for preferring `dplyr`, then I would better use [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html). – Valentin_Ștefan Mar 02 '20 at 12:01

3 Answers3

144

Try this:

myfile %>% mutate(V5 = (V1 == 1 & V2 != 4) + 2 * (V2 == 4 & V3 != 1))

giving:

  V1 V2 V3 V4 V5
1  1  2  3  5  1
2  2  4  4  1  2
3  1  4  1  1  0
4  4  5  1  3  0
5  5  5  5  4  0

or this:

myfile %>% mutate(V5 = ifelse(V1 == 1 & V2 != 4, 1, ifelse(V2 == 4 & V3 != 1, 2, 0)))

giving:

  V1 V2 V3 V4 V5
1  1  2  3  5  1
2  2  4  4  1  2
3  1  4  1  1  0
4  4  5  1  3  0
5  5  5  5  4  0

Note

Suggest you get a better name for your data frame. myfile makes it seem as if it holds a file name.

Above used this input:

myfile <- 
structure(list(V1 = c(1L, 2L, 1L, 4L, 5L), V2 = c(2L, 4L, 4L, 
5L, 5L), V3 = c(3L, 4L, 1L, 1L, 5L), V4 = c(5L, 1L, 1L, 3L, 4L
)), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Update 1 Since originally posted dplyr has changed %.% to %>% so have modified answer accordingly.

Update 2 dplyr now has case_when which provides another solution:

myfile %>% 
       mutate(V5 = case_when(V1 == 1 & V2 != 4 ~ 1, 
                             V2 == 4 & V3 != 1 ~ 2,
                             TRUE ~ 0))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I tried your second solution. I got this error: Error in mutate_impl(.data, named_dots(...), environment()) : REAL() can only be applied to a 'numeric', not a 'logical' Do you know what's going wrong? – rdatasculptor Mar 11 '14 at 22:10
  • 7
    I discovered a way which allows you to not nest the `ifelse` statements: `myfile %>% mutate(V5 = ifelse(V1 == 1 & V2 != 4, 1, 0), V5 = ifelse(V2 == 4 & V3 != 1, 2, V5))` – Alex Sep 18 '14 at 05:22
46

With dplyr 0.7.2, you can use the very useful case_when function :

x=read.table(
 text="V1 V2 V3 V4
 1  1  2  3  5
 2  2  4  4  1
 3  1  4  1  1
 4  4  5  1  3
 5  5  5  5  4")
x$V5 = case_when(x$V1==1 & x$V2!=4 ~ 1,
                 x$V2==4 & x$V3!=1 ~ 2,
                 TRUE ~ 0)

Expressed with dplyr::mutate, it gives:

x = x %>% mutate(
     V5 = case_when(
         V1==1 & V2!=4 ~ 1,
         V2==4 & V3!=1 ~ 2,
         TRUE ~ 0
     )
)

Please note that NA are not treated specially, as it can be misleading. The function will return NA only when no condition is matched. If you put a line with TRUE ~ ..., like I did in my example, the return value will then never be NA.

Therefore, you have to expressively tell case_when to put NA where it belongs by adding a statement like is.na(x$V1) | is.na(x$V3) ~ NA_integer_. Hint: the dplyr::coalesce() function can be really useful here sometimes!

Moreover, please note that NA alone will usually not work, you have to put special NA values : NA_integer_, NA_character_ or NA_real_.

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
12

It looks like derivedFactor from the mosaic package was designed for this. In this example, it would look something like:

library(mosaic)
myfile <- mutate(myfile, V5 = derivedFactor(
    "1" = (V1==1 & V2!=4),
    "2" = (V2==4 & V3!=1),
    .method = "first",
    .default = 0
    ))

(If you want the outcome to be numeric instead of a factor, wrap the derivedFactor with an as.numeric.)

Note that the .default option combined with .method = "first" sets the "else" condition -- this approach is described in the help file for derivedFactor.

Jake Fisher
  • 3,220
  • 3
  • 26
  • 39
  • You can also prevent the result from being a factor using the `.asFactor = F` option or by using the (similar) `derivedVariable` function in the same package. – Jake Fisher Jul 27 '16 at 18:08
  • It looks like `recode` from dplyr 0.5 will do this. I haven't investigated it yet though. See https://blog.rstudio.org/2016/06/27/dplyr-0-5-0/ – Jake Fisher Aug 15 '16 at 19:17
  • This was slow for my data with 1e6 rows. – Fato39 Aug 16 '17 at 14:58
  • 3
    @Fato39 Yes, the `mosaic::derivedFactor` family of functions are very slow. If you figure out why, please answer my SO question about it: https://stackoverflow.com/questions/33787691/why-is-mosaicderivedfactor-twice-as-slow-as-a-base-function. I am glad to see from your other comment that `dplyr::case_when` is faster -- I'll have to switch to that. – Jake Fisher Aug 16 '17 at 21:50
  • I'm trying the following command,library(mosaic) VENEZ.FINAL2 <- mutate(VENEZ, SEX = derivedFactor( "M" = (CATEGORY== "BULL" & CATEGORY!="SIRE"), "F" = (CATEGORY== "COW" & CATEGORY!="HEIFER"), .method = "first", .default = "NA")) but it doesn't work, just solve the condition VENEZ.FINAL2 <- mutate(VENEZ, SEX = derivedFactor( "M" = (CATEGORY== "BULL Could you help me? Thanks a lot! – Johanna Ramirez Apr 21 '20 at 10:21