Extract the maximum value within each group in a dataframe

Question

I have a data frame with a grouping variable ("Gene") and a value variable ("Value"):

Gene   Value
A      12
A      10
B      3
B      5
B      6
C      1
D      3
D      4

For each level of my grouping variable, I wish to extract the maximum value. The result should thus be a data frame with one row per level of the grouping variable:

Gene   Value
A      12
B      6
C      1
D      4

Could aggregate do the trick?

Yes. `aggregate` could do the trick. Just use `max` as the relevant function. — A5C1D2H2I1M1N2O1R2T1, Aug 14 '14 at 17:49

score 197 · Accepted Answer · edited Aug 14 '14 at 19:02

197

There are many possibilities to do this in R. Here are some of them:

df <- read.table(header = TRUE, text = 'Gene   Value
A      12
A      10
B      3
B      5
B      6
C      1
D      3
D      4')

# aggregate
aggregate(df$Value, by = list(df$Gene), max)
aggregate(Value ~ Gene, data = df, max)

# tapply
tapply(df$Value, df$Gene, max)

# split + lapply
lapply(split(df, df$Gene), function(y) max(y$Value))

# plyr
require(plyr)
ddply(df, .(Gene), summarise, Value = max(Value))

# dplyr
require(dplyr)
df %>% group_by(Gene) %>% summarise(Value = max(Value))

# data.table
require(data.table)
dt <- data.table(df)
dt[ , max(Value), by = Gene]

# doBy
require(doBy)
summaryBy(Value~Gene, data = df, FUN = max)

# sqldf
require(sqldf)
sqldf("select Gene, max(Value) as Value from df group by Gene", drv = 'SQLite')

# ave
df[as.logical(ave(df$Value, df$Gene, FUN = function(x) x == max(x))),]

edited Aug 14 '14 at 19:02

talat

68,970
21
126
157

answered Aug 14 '14 at 17:58

EDi

13,160
2
48
57

5

You could add dplyr to the collection (+1) – talat Aug 14 '14 at 18:00
And, the formula method for `aggregate` gives much nicer output... – A5C1D2H2I1M1N2O1R2T1 Aug 14 '14 at 18:04
this shld be an SO R wiki Q/A – hrbrmstr Aug 14 '14 at 18:08
@beginneR: I haven't used dplyr yet (i'm more a data.table-guy), but it was easy to find out. – EDi Aug 14 '14 at 18:19
1

@AnandaMahto: Added also aggregate.formula... – EDi Aug 14 '14 at 18:19
@beginneR. Thanks for the edit. What's the difference between `%.%` and `%>%`? – EDi Aug 14 '14 at 19:01
`%.%` was used in the first versions of `dplyr` but later replace by `%>%` (originally from `magrittr`). I added an approach with `ave` - maybe not the best choice but also a possibility, since you started collecting different approaches. If this post shall be used as a future reference / community wiki, it might also make sense to add a note on the behavior of each method in case of multiple maximas per group (e.g. the `ave` method would return multiple maximas per group, but e.g. dplyr wouldn't) – talat Aug 14 '14 at 19:03
(..Or maybe just add this info to the ave version - I think it's the only one that would return multiple rows per group in case of multiple maximas) – talat Aug 14 '14 at 19:16
One of the simplest and most thorough answers I have seen on SO. Great job! Thanks :). – theforestecologist May 04 '17 at 19:16
I get the following error using aggregate(df$Value, by = list(df$Gene), max) : 'max' not meaningful for factors" – Peter Jun 03 '19 at 15:38
Have you check the class of Value (as the error message suggests?) – EDi Jun 05 '19 at 04:46

score 10 · Answer 2 · answered Jul 24 '17 at 14:25

10

df$Gene <- as.factor(df$Gene)
do.call(rbind, lapply(split(df,df$Gene), function(x) {return(x[which.max(x$Value),])}))

Just using base R

answered Jul 24 '17 at 14:25

Kalees Waran

659
6
13

Ajay Ohri · Answer 3 · 2017-06-22T07:28:50.373

1

Using sqldf and standard sql to get the maximum values grouped by another variable

https://cran.r-project.org/web/packages/sqldf/sqldf.pdf

library(sqldf)
sqldf("select max(Value),Gene from df1 group by Gene")

or

Using the excellent Hmisc package for a groupby application of function (max) https://www.rdocumentation.org/packages/Hmisc/versions/4.0-3/topics/summarize

library(Hmisc)
summarize(df1$Value,df1$Gene,max)

edited Jun 22 '17 at 07:28

answered Jun 22 '17 at 06:26

Ajay Ohri

3,382
3
30
60

Would be possible to get the max and the min values (or the difference between them) in just one command line? – godines Jun 28 '18 at 12:44
hmm looks like I can use: `aggregate(na.omit(Value)~Gene, df, function(x) c(max(x)-min(x)))` – godines Jun 28 '18 at 12:52

Extract the maximum value within each group in a dataframe

3 Answers3

Linked

Related