2

I have found several solutions on the web that can be used to accomplish this task. However, none of them seems to scale in R. What is the best way to do this?

Problem description:
Data: DT or DF is a table that has an id column and otherwise lots of random data The table is already sorted by id (due to construction)

Task: Extract the last n rows for each id in the table

library(data.table)
library(dplyr)
library(magrittr)

# config
set.seed(20160313)
num.rows = 10^5

# build data set
DT <- data.table( c1=runif(num.rows) )
for( i in 2:9 )
    DT[[paste0("c",i)]] <- runif(num.rows)

DT$id <- 1:num.rows %/% 4

# solution with data.table
setkey(DT,id)
cat( "solution: data.table\n" )
print(system.time(
    t1 <- DT[,tail(.SD,n=n),by=id]
))

# solution with dplyr
DF <- as.data.frame(DT)
cat( "solution: dply\n" )
print(system.time(
    t2 <- DF %>% group_by(id) %>% do( tail(.,n=n) )
))

# second solution with dplyr
cat( "solution: dplyr 2\n" )
print(system.time({
    t3 <- DF %>% group_by(id) %>% filter(rank(-row_number()) <= n)
}))


# solution with by command
cat( "solution: by\n" )
print(system.time( {
    temp <- by( DT, DT$id, tail, n=n )
    t4 <- do.call( "rbind", as.list( temp ) )
}))

# solution using split and lapply
cat( "solution: split and lapply\n" )
print(system.time( {
    temp <- split(DT,DT$id)
    temp <- lapply(temp, tail, n=n)
    t5 <- do.call( "rbind", temp )
}))

cat( "solution: via data.table 3\n" )
print(system.time( {
    t6 <- DT[DT[,tail(.I,n),by=id]$V1,]
}))


# failsafe checks
if( all(t1$c1 == t2$c1) )
    cat( "1==2 OK\n" )
if( all(t1$c1 == t3$c1) )
    cat( "1==3 OK\n" )
if( all(t1$c1 == t4$c1) )
    cat( "1==4 OK\n" )
if( all(t1$c1 == t5$c1) )
    cat( "1==5 OK\n" )
if( all(t1$c1 == t6$c1) )
    cat( "1==6 OK\n" )

edit
I did some testing of the answers below with 10^7 rows (caution: some of the solutions above will not work for this many rows)

Best performers:
For n=1, i.e., extracting the last row of each group

system.time( unique(DT,by="id",fromLast=T))
#   user  system elapsed 
#  0.376   0.036   0.411

system.time( DT[,.SD[.N],by=id])
#   user  system elapsed
# 10.636   0.020  10.652

For other n

system.time( DT[DT[,tail(.I,n),by=id]$V1,] )
# for n=2
#   user  system elapsed 
# 33.740   0.112  33.872

# for n=3
#   user  system elapsed 
# 33.988   0.184  34.194

This still seems a bit much but it works for my case.

smci
  • 32,567
  • 20
  • 113
  • 146
camo
  • 422
  • 4
  • 9
  • Did you try `unique(DT, by = "id", fromLast = TRUE)`? OR `DT[!duplicated(DT$id, fromLast = TRUE)]` ? – David Arenburg Mar 13 '16 at 16:38
  • Great! That works just fine for the specific question. Thanks. However, the problem remains if we wanted to use (let's say) the last 2 rows. – camo Mar 13 '16 at 16:43
  • a possible alternative `DT[, .SD[.N], by = id]` – Jaap Mar 13 '16 at 16:45
  • 1
    For the last (up to) `n` rows: `DT[DT[, tail(.I,n), by=id]$V1]` – Frank Mar 13 '16 at 16:56
  • @Frank ; it would be good to add this as an answer rather than hiding away in the comments as it answers the *last n* part , rather than last one (as in the duplicate Q) ediyt: oh just realised it has now been added to the question. ) – user2957945 Apr 05 '18 at 16:21
  • 2
    @user2957945 David answered the OP's original question by closing it. Then the OP expanded their question to what you see now (which is not-great etiquette https://meta.stackoverflow.com/questions/332820/what-to-do-when-someone-answers-dont-be-a-chameleon-dont-be-a-vandal ) Maybe I should revert their edit to only show the original question? I agree it's awkward for the question to contain answer content as it does now ... but seems tolerable. Anyway, I found some more related questions and added them to the dupe list. – Frank Apr 05 '18 at 16:30

0 Answers0