5

I have a data.frame with two columns: year and score. The years go from 2000-2012 and each year can be listed multiple times. In the score column I list all the scores for each year with each row having a different score.

What I'd like to do is filter the data.frame so only the rows with the maximum scores for each year remain.

So as a tiny example if I have

year score
2000    18
2001    22
2000    21

I would want to return just

year score
2001    22
2000    21
Jaap
  • 81,064
  • 34
  • 182
  • 193
user1956609
  • 2,132
  • 5
  • 27
  • 43

5 Answers5

8

using plyr

require(plyr)
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25))
ddply(df, .(year), summarise, max.score=max(score))

using data.table

require(data.table)
dt <- data.table(df, key="year")
dt[, list(max.score=max(score)), by=year]

using aggregate:

o <- aggregate(df$score, list(df$year) , max)
names(o) <- c("year", "max.score")

using ave:

df1 <- df
df1$max.score <- ave(df1$score, df1$year, FUN=max)
df1 <- df1[!duplicated(df1$year), ]

Edit: In case of more columns, a data.table solution would be the best (my opinion :))

set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25), 
               alpha = sample(letters[1:5], 25, replace=T), beta=rnorm(25))

# convert to data.table with key=year
dt <- data.table(df, key="year")
# get the subset of data that matches this criterion
dt[, .SD[score %in% max(score)], by=year]

#     year score alpha       beta
#  1: 2000    20     b  0.8675148
#  2: 2001    21     e  1.5543102
#  3: 2002    22     c  0.6676305
#  4: 2003    18     a -0.9953758
#  5: 2004    23     d  2.1829996
#  6: 2005    25     b -0.9454914
#  7: 2007    17     e  0.7158021
#  8: 2008    12     e  0.6501763
#  9: 2011    24     a  0.7201334
# 10: 2012    19     d  1.2493954
Arun
  • 116,683
  • 26
  • 284
  • 387
  • What if I other columns of data that I want to retain in the filtered result? For example, let's say I have a third column called study_hours, which is the number of hours I studied to get that particular score. I'd want to keep the entire rows that contain the max scores. – user1956609 Feb 14 '13 at 06:56
  • @user1956609 I have amended my example to show how to get other columns – CHP Feb 14 '13 at 08:43
  • @user1956609, in this case, I suggest that you use the `data.table` solution as edited. – Arun Feb 14 '13 at 11:26
8

If you know sql this is easier to understand

library(sqldf)
sqldf('select year, max(score) from mydata group by year')

Update (2016-01): Now you can also use dplyr

library(dplyr)
mydata %>% group_by(year) %>% summarise(max = max(score))
Feng Mai
  • 2,749
  • 1
  • 28
  • 33
7

using base packages

> df
  year score
1 2000    18
2 2001    22
3 2000    21
> aggregate(score ~ year, data=df, max)
  year score
1 2000    21
2 2001    22

EDIT

If you have additional columns that you need to keep, then you can user merge with aggregate to get those columns

> df <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21) , hrs = c( 10, 11, 12))
> df
  year score hrs
1 2000    18  10
2 2001    22  11
3 2000    21  12
> merge(aggregate(score ~ year, data=df, max), df, all.x=T)
  year score hrs
1 2000    21  12
2 2001    22  11
CHP
  • 16,981
  • 4
  • 38
  • 57
1
data <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21))
new.year <- unique(data$year)
new.score <- sapply(new.year, function(y) max(data[data$year == y, ]$score))
data <- data.frame(year = new.year, score = new.score)
AndreyAkinshin
  • 18,603
  • 29
  • 96
  • 155
1

one liner,

df_2<-data.frame(year=sort(unique(df$year)),score = tapply(df$score,df$year,max));
Aditya Sihag
  • 5,057
  • 4
  • 32
  • 43