Select the row with the maximum value in each group

Question

In a dataset with multiple observations for each subject. For each subject I want to select the row which have the maximum value of 'pt'. For example, with a following dataset:

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)
#   Subject pt Event
# 1       1  2     1
# 2       1  3     1
# 3       1  5     2 # max 'pt' for Subject 1
# 4       2  2     1
# 5       2  5     2
# 6       2  8     1
# 7       2 17     2 # max 'pt' for Subject 2
# 8       3  3     2
# 9       3  5     2 # max 'pt' for Subject 3

Subject 1, 2, and 3 have the biggest pt value of 5, 17, and 5 respectively.

How could I first find the biggest pt value for each subject, and then, put this observation in another data frame? The resulting data frame should only have the biggest pt values for each subject.

This is very closely related but for minimum instead of maximum http://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group — David Arenburg, Feb 13 '17 at 13:14
Related: [Subset by group with data.table](https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table) — Henrik, Oct 18 '17 at 14:32

score 138 · Accepted Answer · answered Jul 03 '14 at 16:11

138

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

answered Jul 03 '14 at 16:11

Arun

116,683
26
284
387

4

seeing as data.table has had a LOT of changes since 2014, is this still the fastest/best solution to this question? – Ben May 24 '16 at 21:03
2

@Ben, in this case, fastest answer is still this, yes. `.SD` optimisation for these cases is still on the list. Have an eye on [#735](https://github.com/Rdatatable/data.table/issues/735). – Arun May 24 '16 at 22:26
8

Hi, What is $V1 here? #noob – sriharsha KB Jun 01 '16 at 13:53
1

Accessing the auto-named column. Run it without it to understand better. – Arun Jun 01 '16 at 14:19
1

it works well. can you explain what ```.I``` does in data.table? – HappyCoding Jun 27 '17 at 03:26
2

@HappyCoding, have a look at `?\`.I\`` and see if the explanation and examples there help? – Arun Jun 29 '17 at 23:15
1

.I holds the row location of the max entry per group, which is put in a column named V1. Essentially, `group[, .I[which.max(pt)], by=Subject]$V1` is a vector of positions, and therefore can be used to subset. – desval May 23 '19 at 14:45
Why it isn't faster if I first use setkey? – skan Oct 11 '20 at 19:34
How can we specify to keep missing values in one elegant step? With the solution above, these are dropped and adding `na.rm=TRUE` results in errors ("argument not used" or "wrong number of dimensions" depending on where it is placed). – Magasinus Sep 01 '21 at 09:23

score 114 · Answer 2 · edited Sep 02 '22 at 13:42

114

The most intuitive method is to use group_by and top_n function in dplyr

group %>% group_by(Subject) %>% top_n(1, pt)

The result you get is

Source: local data frame [3 x 3]
Groups: Subject [3]

  Subject    pt Event
    (dbl) (dbl) (dbl)
1       1     5     2
2       2    17     2
3       3     5     2

edited Sep 02 '22 at 13:42

user438383

5,716
8
28
43

answered Feb 17 '16 at 23:06

Xi Liang

1,261
1
9
8

2

dplyr is also useful when you want to access the smallest and largest value in a group because the values are available as an array. So you can first sort by pt descending and then use pt[1] or first(pt) to get the highest value: ```group %>% group_by(Subject) %>% arrange(desc(pt), .by_group = TRUE) %>% summarise(max_pt=first(pt), min_pt=last(pt), Event=first(Event))``` – cw' Jan 16 '19 at 09:39
13

This will include multiple rows if there are ties. Use `slice(which.max(pt))` to only include one row per group. – cakraww Jul 18 '19 at 15:14

score 53 · Answer 3 · edited Apr 07 '15 at 20:33

53

A shorter solution using data.table:

setDT(group)[, .SD[which.max(pt)], by=Subject]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

edited Apr 07 '15 at 20:33

Arun

116,683
26
284
387

answered Apr 07 '15 at 17:12

Mark Chamness

531
4
2

8

Note that, this can be slower than `group[group[, .I[which.max(pt)], by=Subject]$V1]` as proposed above by @Arun; see comparisons [here](https://stackoverflow.com/a/41838383/5193830) – Valentin_Ștefan Jan 21 '19 at 16:56
1

I like this one because it's fast enough for my current context and easier to grok for me vs the `.I` version – arvi1000 Jul 19 '19 at 22:07
setDT(group)[, .SD[ pt== max(pt) ] , by=Subject] – Ferroao Feb 20 '20 at 22:03
How do you exclude cases where you have two max values? – Herman Toothrot Jul 20 '22 at 16:59
Sorry to necro, but for future users: `setDT(group)[, .SD[which.max(pt)][1], by=Subject]` will take the first entry when there are multiple. – Daniel V May 11 '23 at 04:09

akrun · Answer 4 · 2023-02-18T15:00:37.717

31

Another option is slice

library(dplyr)
group %>%
     group_by(Subject) %>%
     slice(which.max(pt))
#    Subject    pt Event
#    <dbl> <dbl> <dbl>
#1       1     5     2
#2       2    17     2
#3       3     5     2

Using dplyr 1.1.0

slice_max(group, pt, by = 'Subject')

edited Feb 18 '23 at 15:00

answered Feb 12 '17 at 14:28

akrun

874,273
37
540
662

score 24 · Answer 5 · edited Apr 20 '23 at 17:53

Since {dplyr} v1.0.0 (May 2020) there is the new slice_* syntax which supersedes top_n().

See also https://dplyr.tidyverse.org/reference/slice.html.

library(tidyverse)

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>% 
  group_by(Subject) %>% 
  slice_max(pt)
#> # A tibble: 3 x 3
#> # Groups:   Subject [3]
#>   Subject    pt Event
#>     <dbl> <dbl> <dbl>
#> 1       1     5     2
#> 2       2    17     2
#> 3       3     5     2

^{Created on 2020-08-18 by the reprex package (v0.3.0.9001)}

score 19 · Answer 6 · edited Sep 06 '18 at 20:19

19

A dplyr solution:

library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>%
    group_by(Subject) %>%
    summarize(max.pt = max(pt))

This yields the following data frame:

  Subject max.pt
1       1      5
2       2     17
3       3      5

edited Sep 06 '18 at 20:19

divibisan

11,659
11
40
58

answered Jul 03 '14 at 16:06

tkmckenzie

1,353
1
10
19

14

I think the OP wants to keep the `Event` column in the subset in which case you could do: `df %>% group_by(Subject) %>% filter(pt == max(pt))` (includes ties if present) – talat Jul 03 '14 at 18:11

score 12 · Answer 7 · answered Jul 25 '17 at 06:10

12

do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))

Using Base R

answered Jul 25 '17 at 06:10

Kalees Waran

659
6
13

score 10 · Answer 8 · answered Dec 29 '19 at 12:15

10

One more base R solution:

merge(aggregate(pt ~ Subject, max, data = group), group)

  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

answered Dec 29 '19 at 12:15

s_baldur

29,441
4
36
69

score 8 · Answer 9 · answered Jul 03 '14 at 16:02

8

I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about

isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1
group[isIDmax, ]

#   ID Value Event
# 3  1     5     2
# 7  2    17     2
# 9  3     5     2

Here we use ave to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.

answered Jul 03 '14 at 16:02

MrFlick

195,160
17
277
295

Thanks very much but I have another question here. Why use with function in this method since ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x))==1 works extremely fine? I am a little bit confused. – Xinting WANG Jul 04 '14 at 12:38
I used `with` because it's a bit odd to have the data available both inside and outside the `group` data.frame. If you read the data in with `read.table` or something, you would need to use `with` because those column names would not be available outside the data.frame. – MrFlick Jul 04 '14 at 15:56

Ape · Answer 10 · 2020-03-27T10:27:14.947

6

Another base solution

group_sorted <- group[order(group$Subject, -group$pt),]
group_sorted[!duplicated(group_sorted$Subject),]

# Subject pt Event
#       1  5     2
#       2 17     2
#       3  5     2

Order the data frame by pt (descending) and then remove rows duplicated in Subject

edited Mar 27 '20 at 10:27

answered Jul 29 '19 at 15:26

Ape

1,159
6
11

GKi · Answer 11 · 2023-04-24T07:34:30.823

In base you can use ave to get max per group and compare this with pt and get a logical vector to subset the data.frame.

group[group$pt == ave(group$pt, group$Subject, FUN=max),]
#  Subject pt Event
#3       1  5     2
#7       2 17     2
#9       3  5     2

Or using in addition with.

group[with(group, pt == ave(pt, Subject, FUN=max)),]

Or compare it already in the function.

group[as.logical(ave(group$pt, group$Subject, FUN=function(x) x==max(x))),]
#group[ave(group$pt, group$Subject, FUN=function(x) x==max(x))==1,] #Variant

Kyoma G · Answer 12 · 2019-04-18T08:43:58.117

2

Here's another data.table solution, since which.max does not work on characters

library(data.table)
group <- data.table(Subject=ID, pt=Value, Event=Event)

group[, .SD[order(pt, decreasing = TRUE) == 1], by = Subject]

edited Apr 18 '19 at 08:43

answered Apr 18 '19 at 08:38

Kyoma G

21
3

score 1 · Answer 13 · answered Apr 27 '20 at 11:09

1

Another data.table solution:

library(data.table)
setDT(group)[, head(.SD[order(-pt)], 1), by = .(Subject)]

answered Apr 27 '20 at 11:09

Vykta Wakandigara

339
2
10

score 1 · Answer 14 · answered Apr 28 '20 at 11:54

by is a version of tapply for data frames:

res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])

It returns an object of class by so we convert it to data frame:

do.call(rbind, b)
  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

chinsoon12 · Answer 15 · 2020-04-10T23:02:08.660

Another data.table option:

library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]

Or another (less readable but slightly faster):

group[group[, rn := .I][order(Subject, -pt), {
    rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]

timing code:

library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)


microbenchmark::microbenchmark(times=3L,
    mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
    mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
    mtd2 = {a2 <- DT2[DT2[, rn := .I][
        order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
    ]]},
    mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE

timings:

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493     3
 mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022     3
 mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814     3
 mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851     3

score 0 · Answer 16 · answered Dec 06 '20 at 21:11

Using dplyr 1.0.2 there are now two ways to do this, one is long hand and the other is using the verb across():

      # create data
      ID    <- c(1,1,1,2,2,2,2,3,3)
      Value <- c(2,3,5,2,5,8,17,3,5)
      Event <- c(1,1,2,1,2,1,2,2,2)
      
      group <- data.frame(Subject=ID, pt=Value, Event=Event)

Long hand the verb is max() but note the na.rm = TRUE which is useful for examples where there are NAs as in the closed question: Merge rows in a dataframe where the rows are disjoint and contain NAs:

       group %>% 
        group_by(Subject) %>% 
        summarise(pt = max(pt, na.rm = TRUE),
                  Event = max(Event, na.rm = TRUE))

This is ok if there are only a few columns but if the table has many columns across() is useful. The examples for this verb are often with summarise(across(start_with... but in this example the columns don't start with the same characters. Either they could be changed or the positions listed:

    group %>% 
        group_by(Subject) %>% 
        summarise(across(1:ncol(group)-1, max, na.rm = TRUE, .names = "{.col}"))

Note for the verb across() 1 refers to the first column after the first actual column so using ncol(group) won't work as that is too many columns (makes it position 4 rather than 3).

score 0 · Answer 17 · answered Jan 05 '23 at 14:26

I sometimes fall back on a solution around the order function, as this then works for any type of column (ie. not just numeric). Basically, order the table by whatever you want to max or min, and then take the first or last of each group.

I also find this to be fairly easily readable for data.table beginners.

library(data.table)

setDT(group)[order(pt)][, last(.SD), by = Subject]

Note that replacing last with first achieves the equivalent of the minimum value.

score 0 · Answer 18 · answered Feb 02 '23 at 15:39

0

With dplyr 1.1.0, you can use slice_max with by to perform inline grouped slicing:

library(dplyr)
group %>% 
  slice_max(pt, n = 1, by = Subject)

#  Subject pt Event
#1       1  5     2
#2       2 17     2
#3       3  5     2

answered Feb 02 '23 at 15:39

Maël

45,206
3
29
67

score -1 · Answer 19 · answered Jul 09 '18 at 06:02

-1

If you want the biggest pt value for a subject, you could simply use:

   pt_max = as.data.frame(aggregate(pt~Subject, group, max))

answered Jul 09 '18 at 06:02

Mutyalama

51
1
2

Select the row with the maximum value in each group

19 Answers19

Linked

Related