1

I want to conditionally generate output in a column based on values in two other columns using data.table. This is something I want to achieve using a function containing 'if' statements specifically, as I'm just trying to learn how to use data.table. The if statement contains shift(), and I think that this may be the cause of my problems.

What I have tried so far is the following:

library(data.table)

#Data

DT <- data.table(V1 = sample(LETTERS[1:3], 20, replace = TRUE),
                 V2 = sample(1:5, 20, replace = TRUE))

#function

fun1 <- function(x, y){
  if(x == "C" & shift(y,  type = "lead") > y){
    return("Greater")
  } else if(x == "C" & shift(y,  type = "lead") < y){
    return("Lesser")
  } else{
    return(NA)
  }
}

#function implementation

DT.v1 <- DT[, V3 := mapply(fun1, x = V1, y = V2)]

If I run the above, I get the error:

Error in if (x == "C" & shift(y, type = "lead") > y) { : 
  missing value where TRUE/FALSE needed

I had an inkling that this error may be caused by comparison with NA in the last iteration as shift(y, type = "lead") would equal NA. Addition of the condition !is.na(shift(y, type = "lead")) in the 'if' statement did stop the error from appearing but leads to generation of only NULL values.

I have been able to generate my desired output (see script below) but would also like to learn how i can achieve this using 'if' statements in a function.

DT.v2 <- DT[V1 == "C" & shift(V2, type = "lead") > V2, V3 := "Greater"][
            V1 == "C" & shift(V2, type = "lead") < V2, V3 := "Lesser"]

#or an alternative way to generate the desired output:

DT.v3 <- DT[, V3 := ifelse(V1 == "C" & shift(V2, type = "lead") > V2, "Greater",
                           ifelse(V1 == "C" & shift(V2, type = "lead") < V2, "Lesser", NA))]

Can someone help me understand how to implement the function in a correct way? Thanks for taking the time to help!

baobab
  • 13
  • 2

2 Answers2

1

The use of mapply subsumes the ability of shift to see data around the row you're working on.

Incidentally, I'm replacing the single-& in the if statements with &&, you should never use & there unless it is aggregated, e.g., within sum, any, all, etc. (See Difference between Boolean operators && and & and between || and | in R for a discussion on the differences.)

One way to approach this is to pass the shifted data as an argument to your function:

set.seed(42)
DT <- data.table(V1 = sample(LETTERS[1:3], 20, replace = TRUE),
                 V2 = sample(1:5, 20, replace = TRUE))

fun2 <- function(x, y, shifty) {
  if (x == "C" && isTRUE(shifty > y)) {
    return("Greater")
  } else if (x == "C" && isTRUE(shifty < y)) {
    return("Lesser")
  } else{
    return(NA)
  }
}

DT[, V3 := mapply(fun2, x = V1, y = V2, shifty = shift(V2, type="lead"))]
#         V1    V2      V3
#     <char> <int>  <char>
#  1:      A     4    <NA>
#  2:      A     5    <NA>
#  3:      A     5    <NA>
#  4:      A     5    <NA>
#  5:      B     4    <NA>
#  6:      B     2    <NA>
#  7:      B     4    <NA>
#  8:      A     3    <NA>
#  9:      C     2  Lesser
# 10:      C     1 Greater
# 11:      A     2    <NA>
# 12:      A     3    <NA>
# 13:      B     2    <NA>
# 14:      B     4    <NA>
# 15:      B     4    <NA>
# 16:      C     2 Greater
# 17:      C     5  Lesser
# 18:      A     4    <NA>
# 19:      A     5    <NA>
# 20:      C     4    <NA>
#         V1    V2      V3

The use of isTRUE in this fun2 is to account for the condition where shifty will be NA; another way to avoid this is to use shifty=shift(V3, type="lead", fill=0) where 0 is some meaningful number to the context of the data and analysis.

In case you don't need to use a function, another option is to use fcase:

DT[, V5 := fcase(
    V1 == "C" & shift(V2, type="lead") > V2, "Greater",
    V1 == "C" & shift(V2, type="lead") < V2, "Lesser" )]

One interesting thing about the use of fcase as opposed to the base if statements: with if, if any of the operands are NA and you don't explicitly account for this, then the condition itself will be NA causing the if statement to fail (see Error in if/while (condition) {: missing Value where TRUE/FALSE needed). This is not true with fcase:

if (NA == 1) 2 else 3
# Error in if (NA == 1) 2 else 3 : missing value where TRUE/FALSE needed
fcase(NA == 1, 2, TRUE, 3)
# [1] 3

Relatedly, while ifelse does not fail, it also does not necessarily work as we want, whereas fifelse gives us the option to deal explicitly with NA conditions:

ifelse(NA == 1, 2, 3)
# [1] NA
fifelse(NA == 1, 2, 3)
# [1] NA
fifelse(NA == 1, 2, 3, 4)
# [1] 4

This is suggested by the formals (and documented in the docs):

formals(ifelse)
# $test
# $yes
# $no
formals(fifelse)
# $test
# $yes
# $no
# $na
# [1] NA
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for the insightful answer! I didn't know about `fcase` and `fifelse`. These seem to be a great way to deal with `NA` operands. – baobab Jun 04 '23 at 14:50
  • They are the `data.table` equivalents of (if you use it) `dplyr`'s `case_when` and `if_else`, all of them aiming to improve the performance and class-safety of `base::ifelse` (and no "case" variant in base, not directly at least). Glad it helped. – r2evans Jun 04 '23 at 17:48
0

The main issue,on top of the NA issue that you rightly pointed out is that the if statement requires condition of length one

a <- 1:10
> if(a>5)
+ { print("hello")}
Error in if (a > 5) { : the condition has length > 1

And the fun1 is passing a vector - update fun1 is not passing a vector see comment below.

In my opinion your second option is the right way to do the job.

DT.v2 <- DT[V1 == "C" & shift(V2, type = "lead") > V2, V3 := "Greater"][
            V1 == "C" & shift(V2, type = "lead") < V2, V3 := "Lesser"]

This is less error prone. If you really need to use if else statement I would suggest checking ?fcase.

DJJ
  • 2,481
  • 2
  • 28
  • 53
  • Your first assertion is not true. By using `mapply`, the OP is passing length-1 arguments to `fun1`. You can verify this with `debug(fun1)` then looking at `x` and `y` when you try the OP's code. – r2evans Jun 02 '23 at 19:22
  • 1
    @r2evans, indeed thanks for the feedback – DJJ Jun 02 '23 at 19:24