I have found several solutions on the web that can be used to accomplish this task. However, none of them seems to scale in R. What is the best way to do this?
Problem description:
Data: DT or DF is a table that has an id column and otherwise lots of random data
The table is already sorted by id (due to construction)
Task: Extract the last n rows for each id in the table
library(data.table)
library(dplyr)
library(magrittr)
# config
set.seed(20160313)
num.rows = 10^5
# build data set
DT <- data.table( c1=runif(num.rows) )
for( i in 2:9 )
DT[[paste0("c",i)]] <- runif(num.rows)
DT$id <- 1:num.rows %/% 4
# solution with data.table
setkey(DT,id)
cat( "solution: data.table\n" )
print(system.time(
t1 <- DT[,tail(.SD,n=n),by=id]
))
# solution with dplyr
DF <- as.data.frame(DT)
cat( "solution: dply\n" )
print(system.time(
t2 <- DF %>% group_by(id) %>% do( tail(.,n=n) )
))
# second solution with dplyr
cat( "solution: dplyr 2\n" )
print(system.time({
t3 <- DF %>% group_by(id) %>% filter(rank(-row_number()) <= n)
}))
# solution with by command
cat( "solution: by\n" )
print(system.time( {
temp <- by( DT, DT$id, tail, n=n )
t4 <- do.call( "rbind", as.list( temp ) )
}))
# solution using split and lapply
cat( "solution: split and lapply\n" )
print(system.time( {
temp <- split(DT,DT$id)
temp <- lapply(temp, tail, n=n)
t5 <- do.call( "rbind", temp )
}))
cat( "solution: via data.table 3\n" )
print(system.time( {
t6 <- DT[DT[,tail(.I,n),by=id]$V1,]
}))
# failsafe checks
if( all(t1$c1 == t2$c1) )
cat( "1==2 OK\n" )
if( all(t1$c1 == t3$c1) )
cat( "1==3 OK\n" )
if( all(t1$c1 == t4$c1) )
cat( "1==4 OK\n" )
if( all(t1$c1 == t5$c1) )
cat( "1==5 OK\n" )
if( all(t1$c1 == t6$c1) )
cat( "1==6 OK\n" )
edit
I did some testing of the answers below with 10^7 rows (caution: some of the solutions above will not work for this many rows)
Best performers:
For n=1, i.e., extracting the last row of each group
system.time( unique(DT,by="id",fromLast=T))
# user system elapsed
# 0.376 0.036 0.411
system.time( DT[,.SD[.N],by=id])
# user system elapsed
# 10.636 0.020 10.652
For other n
system.time( DT[DT[,tail(.I,n),by=id]$V1,] )
# for n=2
# user system elapsed
# 33.740 0.112 33.872
# for n=3
# user system elapsed
# 33.988 0.184 34.194
This still seems a bit much but it works for my case.