162

I would like to select a row with maximum value in each group with dplyr.

Firstly I generate some random data to show my question

set.seed(1)
df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
df$value <- runif(nrow(df))

In plyr, I could use a custom function to select this row.

library(plyr)
ddply(df, .(A, B), function(x) x[which.max(x$value),])

In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).

library(dplyr)
df %>% group_by(A, B) %>%
    summarise(max = max(value))

How could I achieve this? Thanks for any suggestion.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2  plyr_1.8.1

loaded via a namespace (and not attached):
[1] assertthat_0.1.0.99 parallel_3.1.0      Rcpp_0.11.1        
[4] tools_3.1.0        
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Bangyou
  • 9,462
  • 16
  • 62
  • 94

6 Answers6

184

Try this:

result <- df %>% 
             group_by(A, B) %>%
             filter(value == max(value)) %>%
             arrange(A,B,C)

Seems to work:

identical(
  as.data.frame(result),
  ddply(df, .(A, B), function(x) x[which.max(x$value),])
)
#[1] TRUE

As pointed out in the comments, slice may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • 19
    The results are identical in this case because there are no duplicated maximum values present. Otherwise, the `filter` approach would return all maximum values (rows) per group while the OP's ddply approach with `which.max` would only return one maximum (the first) per group. To replicate that behavior, another option is to use `slice(which.max(value))` in dplyr. – talat Jun 25 '15 at 07:23
  • 1
    According with @talat comment the same result may be obtained using `filter(row_number()==1)`. Remenber to arrange the DF before grouping – Tiziano Nov 21 '20 at 17:50
106
df %>% group_by(A,B) %>% slice(which.max(value))
RoyalTS
  • 9,545
  • 12
  • 60
  • 101
  • 10
    `slice_max(.data, order_by, ..., n, prop, with_ties = TRUE)` also now a good option... – Ben May 11 '21 at 17:50
86

You can use top_n

df %>% group_by(A, B) %>% top_n(n=1)

This will rank by the last column (value) and return the top n=1 rows.

Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)

mnel
  • 113,303
  • 27
  • 265
  • 254
18

This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)

library(dplyr)
df %>% group_by(A, B) %>%
  mutate(the_rank  = rank(-value, ties.method = "random")) %>%
  filter(the_rank == 1) %>% select(-the_rank)
nassimhddd
  • 8,340
  • 1
  • 29
  • 44
2

More generally, I think you might want to get "top" of the rows that are sorted within a given group.

For the case of where a single value is max'd out, you have essentially sorted by only one column. However, it's often useful to hierarchically sort by multiple columns (for example: a date column and a time-of-day column).

# Answering the question of getting row with max "value".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in descending order by "value" column.
  arrange( desc(value) ) %>% 
  # Pick the top 1 value
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()

# Answering an extension of the question of 
# getting row with the max value of the lowest "C".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in ascending order by C, and then within that by 
  # descending order by "value" column.
  arrange( C, desc(value) ) %>% 
  # Pick the one top row based on the sort
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()
Kalin
  • 1,691
  • 2
  • 16
  • 22
  • When you `arrange( desc(value) )` then value are grouped? I would like to have there say sum of C. – Przemyslaw Remin Apr 09 '21 at 15:00
  • @PrzemyslawRemin, I'm not sure I fully understand where the "sum of C" is or how you mean to use it. In `dplyr` it's nice to separate your steps. You could `df %>% group_by( A, B) %>% mutate( s = sum(C) )` which will put the sum of C within each group as a (repeated) value `s` within each group (each row gets the same sum value within a group). And then you could `ungroup` and use one of the other methods described to filter max of C, such as `... %>% ungroup() %>% slice(which.max(C))`. – Kalin Apr 20 '21 at 15:31
0

For me, it helped to count the number of values per group. Copy the count table into a new object. Then filter for the max of the group based on the first grouping characteristic. For example:

count_table  <- df %>%
                group_by(A, B) %>%
                count() %>%
                arrange(A, desc(n))

count_table %>% 
    group_by(A) %>%
    filter(n == max(n))

or

count_table %>% 
    group_by(A) %>%
    top_n(1, n)
ksvrd
  • 9
  • 2
  • 3
    If you run this on the data in the question, I think you'll see it's answering the wrong question. The question in the post is about how to find the rows that have the maximum value (the number in the `value` column). This answer ignores the `value` column to find the most common `B` value for each `A`. – Gregor Thomas Feb 01 '19 at 14:49