Filter rows with max value from a range of columns

Question

I have a df with strings and numbers. I need to filter the max value by group excluding the year (date type).

I did this:

test
type Process RegionName Time Level a b c d e f    g       h i j k  l m  n o    p q r s t u v  w
 a1   XYZ_1        ABC 2010 fixed 0 0 0 0 0 0     1     957 0 0 0  0 0  0 0    0 0 0 0 0 0 0  0
 a1   XYZ_2        ABC 2010 fixed 0 0 0 0 0 0     0      61 0 0 0  3 0  0 0    0 0 0 0 0 0 0  0
 a1   XYZ_3        ABC 2010 fixed 0 0 0 0 0 0     0     695 0 0 0  0 0  2 0    0 0 0 0 0 0 0  0
 a1   XYZ_4        ABC 2010 fixed 0 0 0 0 0 0     0     525 0 0 0  0 0  0 0    3 0 0 0 0 0 0  0
 b1   XYZ_5        ABC 2010 fixed 0 0 0 0 0 0 10551 1168053 0 0 0  0 0  0 0    0 0 0 0 0 0 0  0
 b1   XYZ_6        ABC 2010 fixed 0 0 0 0 0 0     0    7571 0 0 0 30 0  0 0    0 0 0 0 0 0 0  0
 b1   XYZ_7        ABC 2010 fixed 0 0 0 0 0 0     0   10883 0 0 0  0 0 51 0    0 0 0 0 0 0 0  0
 b1   XYZ_8        ABC 2010 fixed 0 0 0 0 0 0     0   40453 0 0 0  0 0  0 0  196 0 0 0 0 0 0  0
 b1   XYZ_9        ABC 2010 fixed 0 0 0 0 0 0     0   24464 0 0 0  0 0  0 0    0 0 0 0 0 0 0 53
 c1  XYZ_10        ABC 2010 fixed 0 0 0 0 0 0     0       0 0 0 0 16 0  0 0    0 0 0 0 0 0 0  0
 c1  XYZ_11        ABC 2010 fixed 0 0 0 0 0 0     0     129 0 0 0  0 0  0 0    0 0 0 0 0 0 0  0
 d1  XYZ_12        ABC 2010 fixed 0 0 0 0 0 0     2    1616 0 0 0  0 0  0 0    0 0 0 0 0 0 0  0
 d1  XYZ_13        ABC 2010 fixed 0 0 0 0 0 0     0     762 0 0 4  0 0  0 0    0 0 0 0 0 0 0  0
 d1  XYZ_14        ABC 2010 fixed 0 0 0 0 0 0     0    1002 0 0 0 12 0  0 0    0 0 0 0 0 0 0  0
 d1  XYZ_15        ABC 2010 fixed 0 0 0 0 0 0     0     556 0 0 0  0 0  7 0    0 0 0 0 0 0 0  0
 d1  XYZ_16        ABC 2010 fixed 0 0 0 0 0 0     0  961647 0 0 0  0 0  0 0 4661 0 0 0 0 0 0  0
 d1  XYZ_17        ABC 2010 fixed 0 0 0 0 0 0     0    1381 0 0 0  0 0  0 0    0 0 0 0 0 0 0  3

max_test <- test %>% 
  group_by(type) %>% 
  slice(which.max(a:w)) 
max_test
type  Process RegionName  Time Level a  b c d e f     g       h i j k
a1    XYZ_1   ABC         2010 fixed 0  0 0 0 0 0     1     957 0 0 0
b1    XYZ_5   ABC         2010 fixed 0  0 0 0 0 0 10551 1168053 0 0 0
c1    XYZ_10  ABC         2010 fixed 0  0 0 0 0 0     0       0 0 0 0
d1    XYZ_12  ABC         2010 fixed 0  0 0 0 0 0     2    1616 0 0 0

a1 and b1 are what I expect. However, c1 and d1 are not.

For c1, I expect XYZ_11 because 129 > 16

For d1, I expect XYZ_16 because 961647 > 1616

Any idea what I am doing wrong?

Note: I have not introduce in the code to avoid Time values. I only do which.max(a:w). So, a1 might be considering 2010 instead of 957 as the highest value?

Mark · Accepted Answer · 2023-08-21T06:25:29.723

4

You want this:

df %>% filter(if_any(a:w, ~.x == max(across(a:w))), .by = type)
      
# A tibble: 4 × 28
  type  Process RegionName  Time Level     a     b     c     d     e     f     g
  <chr> <chr>   <chr>      <int> <chr> <int> <int> <int> <int> <int> <int> <int>
1 a1    XYZ_1   ABC         2010 fixed     0     0     0     0     0     0     1
2 b1    XYZ_5   ABC         2010 fixed     0     0     0     0     0     0 10551
3 c1    XYZ_11  ABC         2010 fixed     0     0     0     0     0     0     0
4 d1    XYZ_16  ABC         2010 fixed     0     0     0     0     0     0     0
# ℹ 16 more variables: h <int>, i <int>, j <int>, k <int>, l <int>, m <int>,
#   n <int>, o <int>, p <int>, q <int>, r <int>, s <int>, t <int>, u <int>,
#   v <int>, w <int>

slice, according to the documentation:

lets you index rows by their (integer) locations.

Unsure why you trying to use that to filter. If you're unaware, you filter with the filter function. Have a read of the manual

edited Aug 21 '23 at 06:25

answered Jul 09 '23 at 08:16

Mark

7,785
2
14
34

1

I did try filter but not with your sophistication. Really appreciate it. I also try `colMax <- function(data) sapply(df, max, na.rm = TRUE)`, did not work either. – dmoyaec Jul 09 '23 at 09:29
what is the meaning of `~.x`? – dmoyaec Jul 09 '23 at 10:14
good question! In R, a `~` followed by some code is what is called a lambda function (also known as an anonymous function). It is so called because instead of creating a function, and then calling that function, you can create the function within your code and have it run as if it were a function (this can be very handy when you only want to use a piece of code once or twice) – Mark Jul 09 '23 at 14:22
`.x` in this context is the stand-in for the input to the function - it's the columns a:w, from the bit just before that. – Mark Jul 09 '23 at 14:24
so `max(across(a:w))` finds the max value across the columns a to w, and the `.x ==` part checks if any of the values in each row match that max value. The whole function returns either true or false, which `filter()` uses to filter the rows – Mark Jul 09 '23 at 14:27
I initially tried to use across within filter, but apparently that was deprecated a while ago, and they tell you to use `if_any` or `if_all` instead, but they work in a similar way – Mark Jul 09 '23 at 14:28
If you are still unsure about lambda functions (and you wouldn't be alone lol), here is a good answer I found: https://stackoverflow.com/a/67668150/4145280 – Mark Jul 09 '23 at 14:29
1

oh this is new stuff for me. Many thanks @Mark. – dmoyaec Jul 10 '23 at 04:18
no worries! glad I could help :-) – Mark Jul 10 '23 at 04:37
is there any chance you can check my updated question [here](https://stackoverflow.com/questions/76647324/combine-2-df-to-make-projections-using-previous-result-of-the-cross-join-and-gro)? I am not sure using cumsum, accumulate or lag funtions. – dmoyaec Jul 10 '23 at 05:22
just answered it :-) – Mark Jul 10 '23 at 08:08

score 3 · Answer 2 · answered Jul 09 '23 at 08:19

Using matrixStats::rowMaxs in by.

by(dat, dat$ype, \(x) {
  x[which.max(matrixStats::rowMaxs(as.matrix(x[-(1:5)]))), ]
}) |> do.call(what='rbind')
#    ype Process RegionName Time Level a b c d e f     g       h i j k l m n o    p q r s t u v w
# a1  a1   XYZ_1        ABC 2010 fixed 0 0 0 0 0 0     1     957 0 0 0 0 0 0 0    0 0 0 0 0 0 0 0
# b1  b1   XYZ_5        ABC 2010 fixed 0 0 0 0 0 0 10551 1168053 0 0 0 0 0 0 0    0 0 0 0 0 0 0 0
# c1  c1  XYZ_11        ABC 2010 fixed 0 0 0 0 0 0     0     129 0 0 0 0 0 0 0    0 0 0 0 0 0 0 0
# d1  d1  XYZ_16        ABC 2010 fixed 0 0 0 0 0 0     0  961647 0 0 0 0 0 0 0 4661 0 0 0 0 0 0 0

Data:

dat <- structure(list(ype = c("a1", "a1", "a1", "a1", "b1", "b1", "b1", 
"b1", "b1", "c1", "c1", "d1", "d1", "d1", "d1", "d1", "d1"), 
    Process = c("XYZ_1", "XYZ_2", "XYZ_3", "XYZ_4", "XYZ_5", 
    "XYZ_6", "XYZ_7", "XYZ_8", "XYZ_9", "XYZ_10", "XYZ_11", "XYZ_12", 
    "XYZ_13", "XYZ_14", "XYZ_15", "XYZ_16", "XYZ_17"), RegionName = c("ABC", 
    "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", 
    "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC"), Time = c(2010L, 
    2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
    2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L), Level = c("fixed", 
    "fixed", "fixed", "fixed", "fixed", "fixed", "fixed", "fixed", 
    "fixed", "fixed", "fixed", "fixed", "fixed", "fixed", "fixed", 
    "fixed", "fixed"), a = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), b = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), c = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L), d = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L), e = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), f = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), g = c(1L, 
    0L, 0L, 0L, 10551L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 
    0L, 0L), h = c(957L, 61L, 695L, 525L, 1168053L, 7571L, 10883L, 
    40453L, 24464L, 0L, 129L, 1616L, 762L, 1002L, 556L, 961647L, 
    1381L), i = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L), j = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), k = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 0L, 0L, 0L, 0L), 
    l = c(0L, 3L, 0L, 0L, 0L, 30L, 0L, 0L, 0L, 16L, 0L, 0L, 0L, 
    12L, 0L, 0L, 0L), m = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), n = c(0L, 0L, 2L, 0L, 0L, 
    0L, 51L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 7L, 0L, 0L), o = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L), p = c(0L, 0L, 0L, 3L, 0L, 0L, 0L, 196L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 4661L, 0L), q = c(0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), r = c(0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    ), s = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L), t = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), u = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), v = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L), w = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 53L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 3L)), class = "data.frame", row.names = c(NA, 
-17L))

score 1 · Answer 3 · answered Jul 09 '23 at 11:13

A base R option by subset, with ave + pmax

subset(
    test,
    !!ave(
        do.call(pmax, test[-(1:5)]),
        type,
        FUN = \(x) which.max(x) == seq_along(x)
    )
)

which gives

   type Process RegionName Time Level a b c d e f     g       h i j k l m n o
1    a1   XYZ_1        ABC 2010 fixed 0 0 0 0 0 0     1     957 0 0 0 0 0 0 0
5    b1   XYZ_5        ABC 2010 fixed 0 0 0 0 0 0 10551 1168053 0 0 0 0 0 0 0
11   c1  XYZ_11        ABC 2010 fixed 0 0 0 0 0 0     0     129 0 0 0 0 0 0 0
16   d1  XYZ_16        ABC 2010 fixed 0 0 0 0 0 0     0  961647 0 0 0 0 0 0 0
      p q r s t u v w
1     0 0 0 0 0 0 0 0
5     0 0 0 0 0 0 0 0
11    0 0 0 0 0 0 0 0
16 4661 0 0 0 0 0 0 0

Filter rows with max value from a range of columns

3 Answers3