Keep only first five rows if group has more than five rows

Question

I have a dataframe with USERID(thus we group by this), and other variables statuses and a date.

Some of these USERID's have more than 5 statuses, so we should keep only the 5 most recent ones,by date.

How should I code this, looks simple but I haven't manage to do so.

Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — zx8754, Mar 09 '16 at 11:02
IMHO Please give new users asking their first questions the chance to improve their questions without downvoting immediately otherwise it is somehow frustrating and may drive away new users. Just my personal opinion... — R Yoda, Mar 09 '16 at 11:08

akrun · Accepted Answer · 2016-03-09T11:31:42.293

2

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'USERID', we order the 'date' in decreasing (Assuming that the 'date' column is Date class) and get the first 5 rows with head

library(data.table)
setDT(df1)[order(-date), head(.SD, 5), by=USERID]

Or as @Symbolix mentioned in the comments, we could also make use of .I to get the row index and later remove the NA rows for groups that don't have 5 rows)

 na.omit(setDT(df1)[df1[order(-date), .I[1:5], by= USERID]$V1 ])

data

set.seed(49)
df1 <- data.frame(USERID= sample(LETTERS[1:3], 12, 
  replace=TRUE), date= sample(seq(as.Date('2014-01-01'), 
  length.out=12, by = '1 day')))

edited Mar 09 '16 at 11:31

answered Mar 09 '16 at 10:17

akrun

874,273
37
540
662

1

also `df1[df1[order(-date), .I[1:5], by=userid]$V1 ][!is.na(userid)]` – SymbolixAU Mar 09 '16 at 10:26
I think there is no need for the `if` statement (`head` delivers as many rows as available but not more than the limit) – R Yoda Mar 09 '16 at 11:04

score 2 · Answer 2 · answered Mar 09 '16 at 10:38

If you're a fan of dplyr you can do

library(dplyr)

df %>%
  group_by(USERID) %>%
  arrange(-date) %>%
  slice(1:5) %>%
  ungroup

On 'large' data sets the data.table approach will likely be faster, but dplyr has a slightly easier syntax to get your head around at first (in my opinion).

Keep only first five rows if group has more than five rows

2 Answers2

data