155

In a dataset with multiple observations for each subject. For each subject I want to select the row which have the maximum value of 'pt'. For example, with a following dataset:

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)
#   Subject pt Event
# 1       1  2     1
# 2       1  3     1
# 3       1  5     2 # max 'pt' for Subject 1
# 4       2  2     1
# 5       2  5     2
# 6       2  8     1
# 7       2 17     2 # max 'pt' for Subject 2
# 8       3  3     2
# 9       3  5     2 # max 'pt' for Subject 3

Subject 1, 2, and 3 have the biggest pt value of 5, 17, and 5 respectively.

How could I first find the biggest pt value for each subject, and then, put this observation in another data frame? The resulting data frame should only have the biggest pt values for each subject.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Xinting WANG
  • 1,905
  • 2
  • 15
  • 20
  • 3
    This is very closely related but for minimum instead of maximum http://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group – David Arenburg Feb 13 '17 at 13:14
  • Related: [Subset by group with data.table](https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table) – Henrik Oct 18 '17 at 14:32

19 Answers19

138

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 4
    seeing as data.table has had a LOT of changes since 2014, is this still the fastest/best solution to this question? – Ben May 24 '16 at 21:03
  • 2
    @Ben, in this case, fastest answer is still this, yes. `.SD` optimisation for these cases is still on the list. Have an eye on [#735](https://github.com/Rdatatable/data.table/issues/735). – Arun May 24 '16 at 22:26
  • 8
    Hi, What is $V1 here? #noob – sriharsha KB Jun 01 '16 at 13:53
  • 1
    Accessing the auto-named column. Run it without it to understand better. – Arun Jun 01 '16 at 14:19
  • 1
    it works well. can you explain what ```.I``` does in data.table? – HappyCoding Jun 27 '17 at 03:26
  • 2
    @HappyCoding, have a look at `?\`.I\`` and see if the explanation and examples there help? – Arun Jun 29 '17 at 23:15
  • 1
    .I holds the row location of the max entry per group, which is put in a column named V1. Essentially, `group[, .I[which.max(pt)], by=Subject]$V1` is a vector of positions, and therefore can be used to subset. – desval May 23 '19 at 14:45
  • Why it isn't faster if I first use setkey? – skan Oct 11 '20 at 19:34
  • How can we specify to keep missing values in one elegant step? With the solution above, these are dropped and adding `na.rm=TRUE` results in errors ("argument not used" or "wrong number of dimensions" depending on where it is placed). – Magasinus Sep 01 '21 at 09:23
114

The most intuitive method is to use group_by and top_n function in dplyr

group %>% group_by(Subject) %>% top_n(1, pt)

The result you get is

Source: local data frame [3 x 3]
Groups: Subject [3]

  Subject    pt Event
    (dbl) (dbl) (dbl)
1       1     5     2
2       2    17     2
3       3     5     2
user438383
  • 5,716
  • 8
  • 28
  • 43
Xi Liang
  • 1,261
  • 1
  • 9
  • 8
  • 2
    dplyr is also useful when you want to access the smallest and largest value in a group because the values are available as an array. So you can first sort by pt descending and then use pt[1] or first(pt) to get the highest value: ```group %>% group_by(Subject) %>% arrange(desc(pt), .by_group = TRUE) %>% summarise(max_pt=first(pt), min_pt=last(pt), Event=first(Event))``` – cw' Jan 16 '19 at 09:39
  • 13
    This will include multiple rows if there are ties. Use `slice(which.max(pt))` to only include one row per group. – cakraww Jul 18 '19 at 15:14
53

A shorter solution using data.table:

setDT(group)[, .SD[which.max(pt)], by=Subject]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2
Arun
  • 116,683
  • 26
  • 284
  • 387
Mark Chamness
  • 531
  • 4
  • 2
31

Another option is slice

library(dplyr)
group %>%
     group_by(Subject) %>%
     slice(which.max(pt))
#    Subject    pt Event
#    <dbl> <dbl> <dbl>
#1       1     5     2
#2       2    17     2
#3       3     5     2

Using dplyr 1.1.0

slice_max(group, pt, by = 'Subject')
akrun
  • 874,273
  • 37
  • 540
  • 662
24

Since {dplyr} v1.0.0 (May 2020) there is the new slice_* syntax which supersedes top_n().

See also https://dplyr.tidyverse.org/reference/slice.html.

library(tidyverse)

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>% 
  group_by(Subject) %>% 
  slice_max(pt)
#> # A tibble: 3 x 3
#> # Groups:   Subject [3]
#>   Subject    pt Event
#>     <dbl> <dbl> <dbl>
#> 1       1     5     2
#> 2       2    17     2
#> 3       3     5     2

Created on 2020-08-18 by the reprex package (v0.3.0.9001)

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
pat-s
  • 5,992
  • 1
  • 32
  • 60
19

A dplyr solution:

library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>%
    group_by(Subject) %>%
    summarize(max.pt = max(pt))

This yields the following data frame:

  Subject max.pt
1       1      5
2       2     17
3       3      5
divibisan
  • 11,659
  • 11
  • 40
  • 58
tkmckenzie
  • 1,353
  • 1
  • 10
  • 19
  • 14
    I think the OP wants to keep the `Event` column in the subset in which case you could do: `df %>% group_by(Subject) %>% filter(pt == max(pt))` (includes ties if present) – talat Jul 03 '14 at 18:11
12
do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))

Using Base R

Kalees Waran
  • 659
  • 6
  • 13
10

One more base R solution:

merge(aggregate(pt ~ Subject, max, data = group), group)

  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2
s_baldur
  • 29,441
  • 4
  • 36
  • 69
8

I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about

isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1
group[isIDmax, ]

#   ID Value Event
# 3  1     5     2
# 7  2    17     2
# 9  3     5     2

Here we use ave to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks very much but I have another question here. Why use with function in this method since ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x))==1 works extremely fine? I am a little bit confused. – Xinting WANG Jul 04 '14 at 12:38
  • I used `with` because it's a bit odd to have the data available both inside and outside the `group` data.frame. If you read the data in with `read.table` or something, you would need to use `with` because those column names would not be available outside the data.frame. – MrFlick Jul 04 '14 at 15:56
6

Another base solution

group_sorted <- group[order(group$Subject, -group$pt),]
group_sorted[!duplicated(group_sorted$Subject),]

# Subject pt Event
#       1  5     2
#       2 17     2
#       3  5     2

Order the data frame by pt (descending) and then remove rows duplicated in Subject

Ape
  • 1,159
  • 6
  • 11
4

In base you can use ave to get max per group and compare this with pt and get a logical vector to subset the data.frame.

group[group$pt == ave(group$pt, group$Subject, FUN=max),]
#  Subject pt Event
#3       1  5     2
#7       2 17     2
#9       3  5     2

Or using in addition with.

group[with(group, pt == ave(pt, Subject, FUN=max)),]

Or compare it already in the function.

group[as.logical(ave(group$pt, group$Subject, FUN=function(x) x==max(x))),]
#group[ave(group$pt, group$Subject, FUN=function(x) x==max(x))==1,] #Variant
GKi
  • 37,245
  • 2
  • 26
  • 48
2

Here's another data.table solution, since which.max does not work on characters

library(data.table)
group <- data.table(Subject=ID, pt=Value, Event=Event)

group[, .SD[order(pt, decreasing = TRUE) == 1], by = Subject]
Kyoma G
  • 21
  • 3
1

Another data.table solution:

library(data.table)
setDT(group)[, head(.SD[order(-pt)], 1), by = .(Subject)]
1

by is a version of tapply for data frames:

res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])

It returns an object of class by so we convert it to data frame:

do.call(rbind, b)
  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
0

Another data.table option:

library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]

Or another (less readable but slightly faster):

group[group[, rn := .I][order(Subject, -pt), {
    rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]

timing code:

library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)


microbenchmark::microbenchmark(times=3L,
    mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
    mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
    mtd2 = {a2 <- DT2[DT2[, rn := .I][
        order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
    ]]},
    mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE

timings:

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493     3
 mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022     3
 mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814     3
 mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851     3
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

Using dplyr 1.0.2 there are now two ways to do this, one is long hand and the other is using the verb across():

      # create data
      ID    <- c(1,1,1,2,2,2,2,3,3)
      Value <- c(2,3,5,2,5,8,17,3,5)
      Event <- c(1,1,2,1,2,1,2,2,2)
      
      group <- data.frame(Subject=ID, pt=Value, Event=Event)

Long hand the verb is max() but note the na.rm = TRUE which is useful for examples where there are NAs as in the closed question: Merge rows in a dataframe where the rows are disjoint and contain NAs:

       group %>% 
        group_by(Subject) %>% 
        summarise(pt = max(pt, na.rm = TRUE),
                  Event = max(Event, na.rm = TRUE))

This is ok if there are only a few columns but if the table has many columns across() is useful. The examples for this verb are often with summarise(across(start_with... but in this example the columns don't start with the same characters. Either they could be changed or the positions listed:

    group %>% 
        group_by(Subject) %>% 
        summarise(across(1:ncol(group)-1, max, na.rm = TRUE, .names = "{.col}"))

Note for the verb across() 1 refers to the first column after the first actual column so using ncol(group) won't work as that is too many columns (makes it position 4 rather than 3).

Zoë Turner
  • 459
  • 5
  • 8
0

I sometimes fall back on a solution around the order function, as this then works for any type of column (ie. not just numeric). Basically, order the table by whatever you want to max or min, and then take the first or last of each group.

I also find this to be fairly easily readable for data.table beginners.

library(data.table)

setDT(group)[order(pt)][, last(.SD), by = Subject]

Note that replacing last with first achieves the equivalent of the minimum value.

Serenthia
  • 1,222
  • 4
  • 22
  • 40
0

With dplyr 1.1.0, you can use slice_max with by to perform inline grouped slicing:

library(dplyr)
group %>% 
  slice_max(pt, n = 1, by = Subject)

#  Subject pt Event
#1       1  5     2
#2       2 17     2
#3       3  5     2
Maël
  • 45,206
  • 3
  • 29
  • 67
-1

If you want the biggest pt value for a subject, you could simply use:

   pt_max = as.data.frame(aggregate(pt~Subject, group, max))
Mutyalama
  • 51
  • 1
  • 2