Why does nested ifelse create incorrect results in dplyr 0.5.0 mutate?

Question

Consider the following data frame:

(tmp_df <-
structure(list(class = c(0L, 0L, 1L, 1L, 2L, 2L), logi = c(TRUE, 
FALSE, TRUE, FALSE, TRUE, FALSE), val = c(1, 1, 1, 1, 1, 1), 
    taken = c(1.00684931506849, 0.993197278911565, 1.025, 0.975609756097561, 
    1.00826446280992, 0.991803278688525)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("class", 
"logi", "val", "taken")))

which creates:

Source: local data frame [6 x 4]

  class  logi   val     taken
  <int> <lgl> <dbl>     <dbl>
1     0  TRUE     1 1.0068493
2     0 FALSE     1 0.9931973
3     1  TRUE     1 1.0250000
4     1 FALSE     1 0.9756098
5     2  TRUE     1 1.0082645
6     2 FALSE     1 0.9918033

I wish to group by class, and if each group contains two members, then subtract 1 from val if logi == FALSE, otherwise, subtract the minimum value of taken in that group from val. If each group does not contain two members, then we subtract zero from val.

Code using dplyr package to do the above can be expressed using:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(n() != 2, 0, 
                              ifelse(logi, min(taken), 1)),
           not_taken = val - taken_2)

However, this produces the incorrect result, where by the second ifelse always resolves to the first condition:

Source: local data frame [6 x 6]
Groups: class [3]

  class  logi   val     taken   taken_2   not_taken
  <int> <lgl> <dbl>     <dbl>     <dbl>       <dbl>
1     0  TRUE     1 1.0068493 0.9931973 0.006802721
2     0 FALSE     1 0.9931973 0.9931973 0.006802721
3     1  TRUE     1 1.0250000 0.9756098 0.024390244
4     1 FALSE     1 0.9756098 0.9756098 0.024390244
5     2  TRUE     1 1.0082645 0.9918033 0.008196721
6     2 FALSE     1 0.9918033 0.9918033 0.008196721

The correct result can be produced if we do not have the first ifelse statement.

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(logi, min(taken), 1),
           not_taken = val - taken_2)

producing:

Source: local data frame [6 x 6]
Groups: class [3]

  class  logi   val     taken   taken_2   not_taken
  <int> <lgl> <dbl>     <dbl>     <dbl>       <dbl>
1     0  TRUE     1 1.0068493 0.9931973 0.006802721
2     0 FALSE     1 0.9931973 1.0000000 0.000000000 # correct!
3     1  TRUE     1 1.0250000 0.9756098 0.024390244
4     1 FALSE     1 0.9756098 1.0000000 0.000000000 # correct!
5     2  TRUE     1 1.0082645 0.9918033 0.008196721
6     2 FALSE     1 0.9918033 1.0000000 0.000000000 # correct!

We can see that this problem seems to be isolated to mutate and the nested ifelse by examining other code fragments that successfully do similar stuff:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(n() != 3, 0, 
                            ifelse(logi, min(taken), 1)),
           not_taken = val - taken_2)

tmp_df_2 <-
    tmp_df %>%
    filter(row_number() <= 2)

(tmp_df_2$taken_2 <-
    ifelse(c(0, 0), 0, 
           ifelse(tmp_df_2$logi, min(tmp_df_2$taken), 1)))

## but the following does not work (checks problem is not to do with grouping)
# tmp_df_2 %>%
#     mutate(taken_2 = ifelse(n() != 2, 0, 
#                             ifelse(logi, min(taken), 1)),
#            not_taken = val - taken_2)

Why is this happening, and ~~how can I obtain the expected behaviour~~? A workaround is to split the nested ifelse logic into multiple in-line mutates:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(n() != 2, 0, 1),
           taken_3 = taken_2 * ifelse(logi, min(taken), 1),
           not_taken = val - taken_3)

Someone else has identified a similar problem with nested ifelse but I don't know whether it has the same root: ifelse using dplyr results in NAs for some records

Hugh · Accepted Answer · 2016-11-01T06:17:00.033

6

You are a victim of ifelse vector-recycling. They key is this line:

mutate(taken_2 = ifelse(n() != 2, 0, 
                          ifelse(logi, min(taken), 1))

Because n() != 2 is length-1 (for each group), ifelse only considers the first logi and repeats/recycles this value.

You should use if and if_else:

mutate(taken_2 = if (n() != 2) 0 else if_else(logi, min(taken), 1))

I would recommend never to use ifelse. Take it from someone who almost caused a multi-million dollar error due to this exact bug.

edited Nov 01 '16 at 06:17

answered Nov 01 '16 at 06:07

Hugh

15,521
12
57
100

thanks for your recommendation. `ifelse` was the only way of doing conditional mutates before those functions became available. – Alex Nov 01 '16 at 06:13
Indeed. But now `if_else` is available, you should use it -- and be grateful it is so fussy! – Hugh Nov 01 '16 at 06:13
Also, same comment as I made to @Weihuang, how do you know that the result of the first `ifelse` is length one? – Alex Nov 01 '16 at 06:15
1

You're right that the documentation could be improved. But from `?n`, the title is '**The** number of observations in the current group'. So there can only be one observation. – Hugh Nov 01 '16 at 06:16
The near multi-million dollar error sounds like a good story. – Joe Nov 01 '16 at 07:52
It's actually pretty dull: calculated payment due on table of clients. Used an ifelse condition -- only the condition for the first client in the table was considered. – Hugh Nov 01 '16 at 08:06

score 3 · Answer 2 · answered Nov 01 '16 at 06:07

3

From ?ifelse,

‘ifelse’ returns a value with the same shape as ‘test’

and since n() != 2 returns a vector of length one, and is always true, the second ifelse always returns a vector of length one, but is recycled to fit the shape of the group. One solution is to feed a vector of the length of the group into the first ifelse:

tmp_df %>%
    group_by(class) %>%
    mutate(taken_2 = ifelse(rep(n() != 2, n()), 0, 
                              ifelse(logi, min(taken), 1)),
           not_taken = val - taken_2)
# Source: local data frame [6 x 6]
# Groups: class [3]

#   class  logi   val     taken   taken_2   not_taken
#   <int> <lgl> <dbl>     <dbl>     <dbl>       <dbl>
# 1     0  TRUE     1 1.0068493 0.9931973 0.006802721
# 2     0 FALSE     1 0.9931973 1.0000000 0.000000000
# 3     1  TRUE     1 1.0250000 0.9756098 0.024390244
# 4     1 FALSE     1 0.9756098 1.0000000 0.000000000
# 5     2  TRUE     1 1.0082645 0.9918033 0.008196721
# 6     2 FALSE     1 0.9918033 1.0000000 0.000000000

answered Nov 01 '16 at 06:07

Weihuang Wong

12,868
2
27
48

thanks. I thought `n()` would produce a value for each row, but evidently it only does when explicitly assigned as such, e.g. `num_in_row = n()`. I was trying to get away without creating that extra variable and it caught me out. – Alex Nov 01 '16 at 06:10
How do you know that `n() = 2` always returns a vector of length one anyway, since the documentation for `n()` is exceedingly succint. – Alex Nov 01 '16 at 06:12
1

You're right, I don't actually know `n() == 2` returns a vector of length one; I inferred it from its behavior. – Weihuang Wong Nov 01 '16 at 06:13
1

Yes, `n()` returns a vector of length one: `mtcars %>% summarise(x = length(n()))` so `n() == 2` is length one. If you assign `n()` to a variable, it gets recycled, e.g. `mtcars %>% mutate(x = n())` – alistaire Nov 01 '16 at 06:21
thanks @alistaire, that changes the way I think about `n()` now. – Alex Nov 01 '16 at 06:23

Why does nested ifelse create incorrect results in dplyr 0.5.0 mutate?

2 Answers2

Linked