0

I am trying to run multiple conditional statements in a loop. My first conditional is an if, else if with 3 conditions (4 technically if nothing matches). My second really only needs one condition, and I want to keep the original row value if it doesn't meet that condition. The problem is my output doesn't match the row numbers, and I'm not sure how to output only to a specific row in a loop.

I want to loop over each column, and within each column I use sapply to check each value for falling outside of a range1 (gets marked with 4), inside of range1 (gets marked with 1), is.na (gets marked with 9), otherwise is marked -999. A narrower range would then be used, if each value in a column falls inside of range2, mark with a 3, otherwise don't update.

My partially working code, and a reproducible example is below. My input and first loop is:

df <- structure(list(A = c(-2, 3, 5, 10, NA), A.c = c(NA, NA, NA, NA, NA), B = c(2.2, -55, 3, NA, 99), B.c = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -5L))

> df
   A A.c     B B.c
1 -2  NA   2.2  NA
2  3  NA -55.0  NA
3  5  NA   3.0  NA
4 10  NA    NA  NA
5 NA  NA  99.0  NA

min1 <- 0
max1 <- 8

test1.func <- function(x) {
  val <- if (!is.na(x) & is.numeric(x) & (x < min1 | x > max1){
    num = 4
  } else if (!is.na(x) & is.numeric(x) & x >= min1 & x <= max1){
    num = 1
  } else if (is.na(x)){# TODO it would be better to make this just what is already present in the row
  } else {
    num = -999
  }
  val
}

Test1 <- function(x) {
  i <- NA
  for(i in seq(from = 1, to = ncol(x), by = 2)){
    x[, i + 1] <- sapply(x[[i]], test1.func)
  }
  x
}

df_result <- Test1(df)

> df_result
   A A.c     B B.c
1 -2   4   2.2   1
2  3   1 -55.0   4
3  5   1   3.0   1
4 10   4    NA   9
5 NA   9  99.0   4

The next loop and conditional (any existing values of 4 or 9 would remain):

min2 <- 3
max2 <- 5

test2.func <- function(x) {
  val <- if (!is.na(x) & is.numeric(x) & (x < min2 | x > max2){
    num = 3
  }
  val
}

Test2 <- function(x) {
  i <- NA
  for(i in seq(from = 1, to = ncol(x), by = 2)){
    x[, i + 1] <- sapply(x[[i]], test2.func)
  }
  x
}

df_result2 <- Test2(df_result)
# Only 2.2 matches, if working correctly would output
> df_result2
   A A.c     B B.c
1 -2   4   2.2   3
2  3   1 -55.0   4
3  5   1   3.0   1
4 10   4    NA   9
5 NA   9  99.0   4

Current code errors, since there is only one match:

Warning messages:
1: In `[<-.data.frame`(`*tmp*`, , i + 1, value = list(3, NULL, NULL,  :
  provided 5 variables to replace 1 variables
Anonymous coward
  • 2,061
  • 1
  • 16
  • 29
  • Could you summarise the logic in a sentence or two? – NelsonGon Apr 13 '20 at 15:24
  • Where does that error occur? Btw, bad practice to use `&` (single) in `if` statements, use `&&` (and `||`) instead (see https://stackoverflow.com/q/16027840/3358272 and `?Logic`). – r2evans Apr 13 '20 at 15:27
  • @NelsonGon The logic is, (loop 1) if it is outside of min1 and max1, return a 4 (fail), if it is an NA, return a 9, if it passes return a 1, otherwise return a -999. In loop 2, skip any rows where loop 1 returned a 4 or 9, and if the value is outside of min2 and max2, it returns a 3. – Anonymous coward Apr 13 '20 at 15:33
  • @r2evans The error occurs with `Test2(df_result)`. Thank you for the advice. Loops are not my strength. If I' – Anonymous coward Apr 13 '20 at 15:37
  • 1
    Also I'd strongly recommend some parentheses in your complex conditions. `!is.na(x) & is.numeric(x) & x < min1 | !is.na(x) & is.numeric(x) & x > max1` would be clearer as `(!is.na(x) & is.numeric(x) & x < min1) | (!is.na(x) & is.numeric(x) & x > max1)` (if that's what you mean), which could then be simplified to be `!is.na(x) & is.numeric(x) & (x < min1 | x > max1)`... and perhaps further simplified if you first check for numeric, then missingness, then the numeric conditions rather than vice versa. – Gregor Thomas Apr 13 '20 at 16:28
  • 1
    But, overall when Nelson asks for a summary, it would be helpful if you'd start with something like *"I want to loop over each column. Within each column I use `sapply` to check each value to see..."*. Hearing you state those things would help us confirm that your code attempt matches your intentions. – Gregor Thomas Apr 13 '20 at 16:32
  • @GregorThomas I will update the summary of logic, and simplify the conditions for clarity. – Anonymous coward Apr 13 '20 at 18:49

1 Answers1

2

Some thoughts.

  1. for loops are not necessary, it is better to capitalize on R's vectorized operations;
  2. it appears that your values of 4 and 3 are really something like "outside band 1" and "outside band 2", in which case this can be resolved in one function.
  3. Testing for == "NA" is a bit off ... if one of the values in a column is a string "NA" (and not R's NA value), then all values in that column are strings and you have other problems. Because of this, I don't explicitly check for is.numeric, though it is not hard to work back in.

Try this:

func <- function(x, range1, range2) {
  ifelse(is.na(x), 9L,
         ifelse(x < range1[1] | x > range1[2], 4L,
                ifelse(x < range2[1] | x > range2[2], 3L,
                       1L)))
}

df[,c("A.c", "B.c")] <- lapply(df[,c("A", "B")], func, c(0, 8), c(3, 5))
df
#    A A.c     B B.c
# 1 -2   4   2.2   3
# 2  3   1 -55.0   4
# 3  5   1   3.0   1
# 4 10   4    NA   9
# 5 NA   9  99.0   4

One problem I have with this is that it uses a 3-nested ifelse loop. While this works fine, it can be difficult to trace and troubleshoot (and ifelse has problems of its own). If you have other conditions to incorporate, it might be nice to use dplyr::case_when.

func2 <- function(x, range1, range2) {
  dplyr::case_when(
    is.na(x)                      ~ 9L,
    x < range1[1] | x > range1[2] ~ 4L,
    x < range2[1] | x > range2[2] ~ 3L,
    TRUE                          ~ 1L
  )
}

I find this second method much easier to read, though it does have the added dependency of dplyr (which, while it definitely has advantages and strengths, includes an army of other dependencies). If you are already using any of the tidyverse packages in your workflow, though, this is likely the better solution.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you for elaborating. `dplyr::case_when` might be best to use then, because there will be more after my 2 example tests. I was trying to pin down the logic, then look to see if `data.table` had any solutions, but `case_when` might handle it fine and be much easier to code all of those gates. – Anonymous coward Apr 13 '20 at 16:04
  • 1
    How about [`data.table::fcase`](https://rdatatable.gitlab.io/data.table/reference/fcase.html)? – r2evans Apr 13 '20 at 16:31
  • 1
    (That function is not in CRAN yet, it's in 1.12.9.) – r2evans Apr 13 '20 at 16:38
  • I will take a look at `data.table::fcase` next. – Anonymous coward Apr 13 '20 at 18:57
  • 1
    Thanks for the `data.table::fcase` suggestion. It's about 3 times faster than dplyr::case_when, and of course about 20x faster than a standard for-loop. – Anonymous coward Apr 16 '20 at 01:14