4

I need to subset a data.table based on a condition across multiple columns and then perform an operation on the result.

A minimal example with a simple data.table:

x <- data.table(id=c(1, 2, 3, 4), colour1 = c('red', 'green', 'green', 'blue'), 
                colour2 = c('yellow', 'red', 'blue', 'black'), 
                colour3 = c('blue', 'black', 'red', 'yellow'),
                score = c(0.7, 0.9, 0.2, 0.35))

I then want to find the maximum score for any row that contains the colour 'yellow':

max_score <- max(x[colour1 == 'yellow' | colour2 == 'yellow' | colour3 == 'yellow']$score)

In this example, the above code works fine but is there a better way of checking this condition across multiple columns? In practice, the table will be much larger and the number of columns can change so I'd like to do this dynamically.

Danny Friar
  • 383
  • 4
  • 17
  • Related almost duplicate post: http://stackoverflow.com/questions/18589595/filter-rows-based-on-multiple-column-conditions-r – zx8754 Jun 22 '16 at 10:52

5 Answers5

7

It may feel difficult because your data is in a wide (i.e. "untidy") format. Just convert to long format using melt and it's really simple.

melt(x, measure.vars = grep("^colo", names(x)))[,max(score[value == "yellow"])]
# [1] 0.7

or

melt(x, measure.vars = grep("^colo", names(x)))[value == "yellow", max(score)]

To see what's going on, run this first

melt(x, measure.vars = grep("^colo", names(x)))

and then the whole command.


Of course you can do it in two steps too, in case you want to keep the long format for further calculations:

x_tidy <- melt(x, measure.vars = grep("^colo", names(x)))
x_tidy[value == "yellow", max(score)]
#[1] 0.7
talat
  • 68,970
  • 21
  • 126
  • 157
2

Using rowSums:

max(
  x[ rowSums(x[, grepl("colour", colnames(x)), with = FALSE] == "yellow") > 0,
     "score", with = FALSE]
  )
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

Using R computing on the language feature.

i.cols.equal <- function(cols, value) {
    l <- lapply(cols, function(col) call("==", as.name(col), value))
    Reduce(function(a, b) bquote(.(a) | .(b)), l)
}
ii <- i.cols.equal(cols=c("colour1", "colour2", "colour3"), value="yellow")
print(ii)
#colour1 == "yellow" | colour2 == "yellow" | colour3 == "yellow"
x[eval(ii), max(score)]
#[1] 0.7
jangorecki
  • 16,384
  • 4
  • 79
  • 160
1

with native R

y=data.frame(x)
max(y$score[apply(y[,grep("^colo", colnames(y))]=="yellow",1,any)])

[1] 0.7
Robert
  • 5,038
  • 1
  • 25
  • 43
1

Another idea, similar to jangorecki's (probably more straightforward), is to simply loop through the columns accumulating a "logical" vector of rows to check for max in x[["score"]] to save some memory usage:

i = logical(nrow(x))
for(j in which(startsWith(names(x), "colour"))) i = i | x[[j]] == "yellow" 
max(x[["score"]][i])
#[1] 0.7
alexis_laz
  • 12,884
  • 4
  • 27
  • 37