0

I have a dataframe with USERID(thus we group by this), and other variables statuses and a date.

Some of these USERID's have more than 5 statuses, so we should keep only the 5 most recent ones,by date.

How should I code this, looks simple but I haven't manage to do so.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 2
    Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – zx8754 Mar 09 '16 at 11:02
  • 1
    IMHO Please give new users asking their first questions the chance to improve their questions without downvoting immediately otherwise it is somehow frustrating and may drive away new users. Just my personal opinion... – R Yoda Mar 09 '16 at 11:08

2 Answers2

2

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'USERID', we order the 'date' in decreasing (Assuming that the 'date' column is Date class) and get the first 5 rows with head

library(data.table)
setDT(df1)[order(-date), head(.SD, 5), by=USERID]

Or as @Symbolix mentioned in the comments, we could also make use of .I to get the row index and later remove the NA rows for groups that don't have 5 rows)

 na.omit(setDT(df1)[df1[order(-date), .I[1:5], by= USERID]$V1 ])

data

set.seed(49)
df1 <- data.frame(USERID= sample(LETTERS[1:3], 12, 
  replace=TRUE), date= sample(seq(as.Date('2014-01-01'), 
  length.out=12, by = '1 day')))
akrun
  • 874,273
  • 37
  • 540
  • 662
2

If you're a fan of dplyr you can do

library(dplyr)

df %>%
  group_by(USERID) %>%
  arrange(-date) %>%
  slice(1:5) %>%
  ungroup

On 'large' data sets the data.table approach will likely be faster, but dplyr has a slightly easier syntax to get your head around at first (in my opinion).

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139