2

I have a data set with many rows and the following columns: an id column, a set of columns showing the result of one round of measurements for several values (val1.x, val2.x, val3.x, ...) and another set of columns showing the result of another round of measurements for the same values (val1.y, val2.y, val3.y, ...). Here is a reduced working example:

d <- data.table(
  id = 1:10,
  val1.x = c(1, 0, 0, 1, 0, 1, 0, 0, 1, 0),
  val2.x = c(1, 0, 1, 1, 0, 0, 0, 0, 0, 0),
  val1.y = c(0, 0, 0, 1, 0, NA, NA, 0, 1, 0),
  val2.y = c(1, 0, 0, NA, 0, 1, 0, 0, 1, 0)
)

My goal is to get to a data set which lists the same columns, along with the maximum of the two measurements for each value. Here is the desired output for above example

    id val1.x val2.x val1.y val2.y val1.max val2.max
 1:  1      1      1      0      1        1        1
 2:  2      0      0      0      0        0        0
 3:  3      0      1      0      0        0        1
 4:  4      1      1      1     NA        1        1
 5:  5      0      0      0      0        0        0
 6:  6      1      0     NA      1        1        1
 7:  7      0      0     NA      0        0        0
 8:  8      0      0      0      0        0        0
 9:  9      1      0      1      1        1        1
10: 10      0      0      0      0        0        0

As evident from the example, by max I mean max(..., na.rm = T). I also have a variable cols already prepare with this value:

cols <- c('val1', 'val2')

The Goal

I want to use this variable to loop through the columns dynamically and calculate the max.

What is a good dplyr way of achieving this?

What is a good data.table way of achieving this?

NB: I do not want to use the order of columns (so solutions that reference columns by their order (e.g. 2:3) are not desirable. The input may change and additional columns may be added to the left of the values, so I need to use the names of the columns to do the calculations. The id column will always be unique per row.

What I have tried so far

I can get the right-hand side of the equation to be dynamic using as.symbol like this:

d[, .(val1.max := pmax(eval(as.symbol('val1.x')), eval(as.symbol('val2.x'))))]

But I cannot get the left-hand side to become dynamic.

I also tried to implement a solution based on this SO question but it gives me an error:

left <- "va1.x"
right <- "va1.y"
new <- "val1.max"
expr <- bquote(.(as.name(new)):=pmax(as.name(left), as.name(right), na.rm=T))
d[, eval(expr)]
Merik
  • 2,767
  • 6
  • 25
  • 41

1 Answers1

2

One option in data.table is to melt

library(data.table)
d[melt(d, measure = patterns(cols))[,
    lapply(.SD, max, na.rm = TRUE), .(id), 
    .SDcols = value1:value2], paste0(cols, ".max") :=
         .(value1, value2), on = .(id)][]
#    id val1.x val2.x val1.y val2.y val1.max val2.max
# 1:  1      1      1      0      1        1        1
# 2:  2      0      0      0      0        0        0
# 3:  3      0      1      0      0        0        1
# 4:  4      1      1      1     NA        1        1
# 5:  5      0      0      0      0        0        0
# 6:  6      1      0     NA      1        1        1
# 7:  7      0      0     NA      0        0        0
# 8:  8      0      0      0      0        0        0
# 9:  9      1      0      1      1        1        1
#10: 10      0      0      0      0        0        0

Or another option without melting would be to subset the columns based on the values in 'cols', and use pmax

d[,  paste0(cols, ".max") := lapply(cols, function(pat)
     do.call(pmax, c(.SD[, grep(paste0('^', pat, '$'), 
           names(.SD)), with =  FALSE], na.rm = TRUE)))]
#    id val1.x val2.x val1.y val2.y val1.max val2.max
# 1:  1      1      1      0      1        1        1
# 2:  2      0      0      0      0        0        0
# 3:  3      0      1      0      0        0        1
# 4:  4      1      1      1     NA        1        1
# 5:  5      0      0      0      0        0        0
# 6:  6      1      0     NA      1        1        1
# 7:  7      0      0     NA      0        0        0
# 8:  8      0      0      0      0        0        0
# 9:  9      1      0      1      1        1        1
#10: 10      0      0      0      0        0        0

Or with tidyverse, reshape into 'long' with pivot_longer, do a group by max for multiple columns in summarise_at and join with the original dataset

library(dplyr)
library(tidyr)
d %>%
   pivot_longer(cols = -id, names_sep="[.]", names_to = c(".value", "group")) %>% 
   group_by(id) %>%
   summarise_at(vars(starts_with('val')),
     list(max = ~max(., na.rm = TRUE))) %>% 
   left_join(d, .)
#   id val1.x val2.x val1.y val2.y val1_max val2_max
#1   1      1      1      0      1        1        1
#2   2      0      0      0      0        0        0
#3   3      0      1      0      0        0        1
#4   4      1      1      1     NA        1        1
#5   5      0      0      0      0        0        0
#6   6      1      0     NA      1        1        1
#7   7      0      0     NA      0        0        0
#8   8      0      0      0      0        0        0
#9   9      1      0      1      1        1        1
#10 10      0      0      0      0        0        0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This is a reasonable workaround, but I am specifically looking for a solution in which dynamic column references are used. For instance, I know that I can make the right hand side of the calculation dynamic with something like `d[, .(val1.max := pmax(eval(as.symbol('val1.x')), eval(as.symbol('val2.x'))))]` but I cannot get the left hand side to be dynamic – Merik Dec 05 '19 at 19:12
  • Also, note that your data.table solution is hardcoding the number of columns it can process (in this case, only two columns, because you specified `value1, value2`). I am trying to avoid any hardcoding. – Merik Dec 05 '19 at 19:20
  • @Merik I also provided a `tidyverse` solutioin – akrun Dec 05 '19 at 19:30
  • @Merik You can check my updated data.table solution with `pmax` – akrun Dec 05 '19 at 19:44
  • The second solution is smart. Can you please edit your solution, remove the first solution (which is limited to two columns), and also explain the logic of `.SD[, grep(pat, names(.SD)), with = FALSE]` and why it needs a `grep`? – Merik Dec 06 '19 at 00:41
  • 1
    Also, I think the grep should be modified to `grep(paste0('^', pat, '$'), names(.SD))` so that if we happen to have other columns with partially matching names, they are not picked up – Merik Dec 06 '19 at 01:00