Subsetting by group condition in R

Question

I have a data set that looks somewhat like this

id     year
 1     2012
 1     2014
 1     2015
 2     2014
 2     2018
 2     2019
 3     2011
 4     2010

I only want to keep based on these two conditions (1) one observation per id, (2) the latest year of that id. So, for example, for id 1, I only want to have the 2015 row, for id 2, I only want 2019 row, and for id 3 and 4, I only have one observation during those two years so just keep them both.

I have tried a few different things like:

df1<-subset(df, interaction(df$id, max(df$year)))

I know this subset doesn't make sense but I was just making things up, hoping that something would make sense in my head. Another one I tried,

lapply(unique(df$id), function(max) subset(df, mac(year)))

but I keep getting errors.

Any help would be greatly appreciated! Thank you in advance.

score 1 · Answer 1 · answered Nov 27 '19 at 22:37

You should group and then filter

df %>%
  group_by(id) %>%
  filter(year == max(year)) %>%
  ungroup()

Differently from the solution with slice proposed by @akrun, this will return all the rows where the year is the max for each id. it depends on your needs.

score 0 · Accepted Answer · answered Nov 27 '19 at 22:12

We can slice the rows where the 'year' is max after grouping by 'id'

library(dplyr)
df %>% 
    group_by(id) %>%
    slice(which.max(year))
# A tibble: 4 x 2
# Groups:   id [4]
#     id  year
#  <int> <int>
#1     1  2015
#2     2  2019
#3     3  2011
#4     4  2010

If these are the only columns, after grouping by 'id', apply max on 'year' to summarise

df %>%
     group_by(id) %>%
     summarise(year = max(year))

Or with base R

aggregate(year ~ id, df, FUN = max)

data

df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L), year = c(2012L, 
2014L, 2015L, 2014L, 2018L, 2019L, 2011L, 2010L)),
class = "data.frame", row.names = c(NA, 
-8L))

These are not the only columns but the first code definitely works! Thank you so much! — Lee12345, Nov 27 '19 at 22:17

G. Grothendieck · Answer 3 · 2019-11-27T22:55:40.990

1) subset/duplicated Using df shown reproducibly in the Note at the end and assuming that it is sorted by year within id (which is the case for the data shown in the question) use subset with duplicated. No packages are used.

subset(df, !duplicated(id, fromLast = TRUE))
##   id year
## 3  1 2015
## 6  2 2019
## 7  3 2011
## 8  4 2010

2) subset/ave Another way to use subset is to use ave. This one does not depend on the input being sorted.

subset(df, ave(year, id, FUN = max) == year)
##   id year
## 3  1 2015
## 6  2 2019
## 7  3 2011
## 8  4 2010

3) by bAnother base approach is to use by. It returns a list of data frames which we rbind together.

do.call("rbind", by(df, df$id, function(x) x[which.max(x$year), ]))
##   id year
## 1  1 2015
## 2  2 2019
## 3  3 2011
## 4  4 2010

4) tapply We can use tapply if it is ok to return a vector of years whose names are the id's. Again, this uses only base R.

with(df, tapply(year,  id, max))
##    1    2    3    4 
## 2015 2019 2011 2010

I had also suggested using aggregate but another responder also provided that answer.

Note

Lines <- "id     year
 1     2012
 1     2014
 1     2015
 2     2014
 2     2018
 2     2019
 3     2011
 4     2010"
df <- read.table(text = Lines, header = TRUE)

Subsetting by group condition in R

3 Answers3

data

Note