Choose column name of the first column which fits certain logical test

Question

I have following input:

id <- c("a", "b", "c", "d")
target <- seq(from = 100, to = 400, length.out = 4)
a <- c(300, 304, 100, 405)
b <- c(300, 104, 100, 405)
c <- c(85, 304, 500, 405)
df <- as.data.frame(cbind(id, target, a, b, c))

I would like to add a new column "column" which indicates per row, which of the columns "a", "b", "c" would be the first column with a value smaller than the target solution. The requested output looks like this:

Required Output:

df$column <- c("c", "b", "a", "NA")
df

I thought about a concenated if check per row and apply this to all rows with the apply function. However the abc columns are quite long (round 20, therefore a loop would be required) and the number of rows are about 4.000. Does anybody have an idea on how to solve it?

Thanks for the answer. I changed my system to Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) RStudio Safari/538.1 Qt/5.4.0 since that time the skript gives me following error message: Error in Ops.data.frame(df[, 3:7], df[, 2]) : ‘<’ only defined for equally-sized data frames and if I change the code to df[, 3:7], df[[, 2]]) Error in [[.data.frame(df, , 2) : argument "..1" is missing, with no default Any idea, how I could solve the issue? — Nils, Jan 11 '17 at 19:01

Jaap · Answer 1 · 2016-10-30T11:15:30.187

5

You can do this as follows:

1) Create a logical matrix indicating whether or not a value in the 'a', 'b' or 'c' column is smaller than the target column:

m <- df[,3:5] < df[,2]

2) Create an integer vector which is the first names of these three column that has a value smaller than the target column with max.col and make sure that a NA value is returned for rows where no values was smaller with [c(TRUE,NA)[1 + (rowSums(m) == 0)]]:

mc <- max.col(m, ties.method = 'first')[c(TRUE,NA)[1 + (rowSums(m) == 0)]]

3) Assign the names to a new column:

df$column <- names(df[,3:5])[mc]

which gives:

> df
  id target   a   b   c column
1  a    100 300 300  85      c
2  b    200 304 104 304      b
3  c    300 100 100 500      a
4  d    400 405 405 405   <NA>

I separated the steps to make it more clear what the code does. But you can of course integrate it more like follows:

m <- df[,3:5] < df[,2]
df$column <- names(df[,3:5])[max.col(m, ties.method = 'first')[c(TRUE,NA)[1 + (rowSums(m) == 0)]]]

edited Oct 30 '16 at 11:15

answered Oct 29 '16 at 16:49

Jaap

81,064
34
182
193

I never thought this could be vectorized! You've got my +1. Is this a good place / time to suggest some feedback? – Ivan Oct 29 '16 at 22:07
Thanks for the answer. I changed my system to Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) RStudio Safari/538.1 Qt/5.4.0 since that time the skript gives me following error message: Error in Ops.data.frame(df[, 3:7], df[, 2]) : ‘<’ only defined for equally-sized data frames and if I change the code to df[, 3:7], df[[, 2]]) Error in [[.data.frame(df, , 2) : argument "..1" is missing, with no default Any idea, how I could solve the issue? – Nils Jan 11 '17 at 19:01
@Nils can't tell for sure from solely the error-message; please include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) in the question – Jaap Jan 11 '17 at 19:45
Hi Jaap, thanks solved it. Problem was, the data frame was also defined as data.table. Therefore the [] were interpreted differently... – Nils Jan 17 '17 at 21:26

score 3 · Accepted Answer · answered Oct 29 '16 at 20:09

3

Here's another vectorized solution using which. This is basically takes all the occurrences where target is larger and the takes the first instances using the duplicated function.

indx <- which(df[, 3:5] < df[, 2], arr.ind = TRUE)
indx2 <- indx[!duplicated(indx[, "row"]),]
df[indx2[, "row"], "column"] <- names(df)[3:5][indx2[, "col"]]
df
#   id target   a   b   c column
# 1  a    100 300 300  85      c
# 2  b    200 304 104 304      b
# 3  c    300 100 100 500      a
# 4  d    400 405 405 405   <NA>

answered Oct 29 '16 at 20:09

David Arenburg

91,361
17
137
196

Thanks for the answer. I changed my system to Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) RStudio Safari/538.1 Qt/5.4.0 since that time the skript gives me following error message: Error in Ops.data.frame(df[, 3:7], df[, 2]) : ‘<’ only defined for equally-sized data frames and if I change the code to df[, 3:7], df[[, 2]]) Error in `[[.data.frame`(df, , 2) : argument "..1" is missing, with no default Any idea, how I could solve the issue? – Nils Jan 11 '17 at 18:58

score 0 · Answer 3 · edited Oct 29 '16 at 20:10

0

You can apply a function along the rows e.g. and use the result for populating your column etc

searchFunction <- function(row) {
  result <- "NA"
  for (name in names(row)) {
    if (name == "target" || name == "id") {
      next
    }
    if (result == "NA" && as.numeric(row[name]) < as.numeric(row["target"])) {
      result = name
    }
  }
  return(result);
}

apply(df, 1, searchFunction)
# [1] "c"  "b"  "a"  "NA"

edited Oct 29 '16 at 20:10

David Arenburg

91,361
17
137
196

answered Oct 29 '16 at 16:59

Ivan

3,781
16
20

No need for a for-loop imo. Most functions in R are vectorised. – h3rm4n Oct 29 '16 at 18:27
@h3rm4n thanks for the feedback, I would be very curious to see vectorized version of this – Ivan Oct 29 '16 at 18:30
1

For loops are not so bad in general but you are basically performing a double for loop here (`apply` is also a for loop) while the second one is per row- which could get nasty for a big data set. – David Arenburg Oct 29 '16 at 20:13
@DavidArenburg thank you for sharing that detail on apply – Ivan Oct 29 '16 at 22:11

Choose column name of the first column which fits certain logical test

3 Answers3