0

I have a data frame grouped by ID with multiple rows per ID and several variables a, b, c, etc.

Here is a toy example:

dt <- structure(list(ID = c(1, 1, 2, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 
5, 5, 6, 6, 6, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 9, 10, 10), a = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1), b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1), c = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), d = c(1, 1, 0, 0, 
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
0, 0, 0, 0, 1, 1), e = c(0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1), f = c(1, 
1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 1, 1, 1, 0, 1, 1, 1, 1), g = c(1, 1, 1, 1, 0, 0, 0, 0, 1, 
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1), h = c(1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1), i = c(1, 1, 1, 1, 
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 
0, 0, 0, 0, 1, 1)), row.names = c(NA, -31L), class = c("tbl_df", 
"tbl", "data.frame"))

For each ID, I want to extract the row that represents the maximum value for each variable (taking first or last instance of max value is not a concern). There are plenty of examples of how to do this when only considering one variable for example. However, I am having difficulty applying this across multiple variables.

Here was my attempt to solve the problem (using data.table and lapply):

library(data.table)
setDT(dt)
variables = colnames(dt[, 2:10])
dt_max = dt[, lapply(.SD, which.max), .SDcols = variables, by = "ID"]

Looking at what this produces, it appears that values are being summed rather than the maximum value being extracted for each ID:

    ID a b c d e f g h i
 1:  1 1 1 1 1 2 1 1 1 1
 2:  2 1 1 1 1 1 1 1 1 1
 3:  3 1 1 1 1 5 1 1 2 1
 4:  4 1 1 1 1 1 1 1 1 1
 5:  5 1 1 1 1 1 3 1 1 1
 6:  6 1 1 1 1 1 1 1 1 1
 7:  7 1 1 1 1 1 1 1 1 1
 8:  8 1 1 1 1 1 2 1 2 2
 9:  9 1 1 1 1 1 2 1 1 1
10: 10 1 1 1 1 1 1 1 1 1

This is my desired/expected output:

    ID a b c d e f g h i
 1:  1 1 1 1 1 1 1 1 1 1
 2:  2 1 1 1 0 0 1 1 0 1
 3:  3 1 1 1 0 1 1 1 1 1
 4:  4 1 1 1 0 0 1 1 0 0
 5:  5 1 1 1 1 1 1 1 0 0
 6:  6 1 1 1 1 1 1 1 0 1
 7:  7 1 1 1 1 1 0 1 0 0
 8:  8 1 1 1 1 0 1 1 1 1
 9:  9 1 1 1 0 1 1 1 0 0
10: 10 1 1 1 1 1 1 1 1 1

I am at a loss for why this would be happening. My only other thought would be to do this the long way for each variable separately, then merging the results together. But this seems like a very inefficient way to solve the problem.

Any help would be greatly appreciated!

jmogil
  • 143
  • 8
  • I'm curious about "extract the row that represents the maximum value for each variable". Based on the link you provided, does your result need to be list of then matrix? – Park Sep 16 '21 at 06:55
  • 3
    What is your expected output? Do you need `dt[, lapply(.SD, max), .SDcols = variables, by = "ID"]` ? – Ronak Shah Sep 16 '21 at 06:56
  • 1
    For a tidyverse solution, look at `slice_max()`. As you will potentially have a different row selected for the max of each variable, you need to specify the format of your desired output. There are many possibilities... – Limey Sep 16 '21 at 06:59
  • Simpler: `dt1[, lapply(.SD, which.max), by = "ID"]`. The argument `.SDcols` is not needed. And I'm not seeing anything wrong with the output. – Rui Barradas Sep 16 '21 at 07:05
  • @RonakShah and Limey: I have edited the question with my desired output. Also, Ronak's solution works. Why would `max` work, but not `which.max`? – jmogil Sep 16 '21 at 07:08

1 Answers1

1

max and which.max are two different functions doing different things. max would give the max value in a vector whereas which.max would give position of the max value in the vector.

x <- 4:1

max(x)
#[1] 4
which.max(x)
#[1] 1

Here which.max returns 1 because 4 is present at the 1st position in the vector x.

So if you need max values in multiple columns, you should use max and not which.max.

library(data.table)
setDT(dt)
variables = colnames(dt[, 2:10])

dt[, lapply(.SD, max), .SDcols = variables, ID]

#    ID a b c d e f g h i
# 1:  1 1 1 1 1 1 1 1 1 1
# 2:  2 1 1 1 0 0 1 1 0 1
# 3:  3 1 1 1 0 1 1 1 1 1
# 4:  4 1 1 1 0 0 1 1 0 0
# 5:  5 1 1 1 1 1 1 1 0 0
# 6:  6 1 1 1 1 1 1 1 0 1
# 7:  7 1 1 1 1 1 0 1 0 0
# 8:  8 1 1 1 1 0 1 1 1 1
# 9:  9 1 1 1 0 1 1 1 0 0
#10: 10 1 1 1 1 1 1 1 1 1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213