Efficiently selecting top number of rows for each unique value of a column in a data.frame

Question

I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Selecting top finite number of rows for each unique value of a column in a data fame in R However, i want to vary the number of items selected by the head() command.

#Sample data
input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
colnames(input) <- c( "Product" , "Something" ,"Date")
input <- as.data.frame(input)
input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")

#Sort based on date, I want to leave out the entries with the oldest dates.
input <- input[ with( input, order(Date)), ]

#Create number of items I want to select
table_input <- as.data.frame(table(input$Product))
table_input$twentyfive <- ceiling( table_input$Freq*0.25  )

#This next part is a very time consuming method (Have 2 mln rows, 90k different products)

first <- TRUE

for( i in table_input$Var1 ) {
  data_selected <- input[input$Product == i,]
  number <- table_input[table_input$Var1 == i ,]$twentyfive

  head <- head( data_selected, number)        

  if( first == FALSE) {
    output <- rbind(output, head)
  } else {
    output <- head
  }
  first <- FALSE
}

Hoping that someone knows a better, more efficient way. I tried to use the split function from the answer here: Selecting top finite number of rows for each unique value of a column in a data fame in R to split on the products and then try to iterate over them and select the head(). However the split function always runs out of memory (cannot allocate ..)

input_split <- split(input, input$Product) #Works here, but not i my problem.

So in the end my problem is that i want te select a different amount of each unique Product. So here 2 items from 1000001 and 1 item from 1000002 and 1000003.

The lack of answers is probably because you have a overly complicated script up there. Try to reduce the problem down to the bare essentials so it is easy, easy, easy to read and I think you will get an answer quicker. — Simon O'Hanlon, Oct 17 '13 at 11:29

score 10 · Accepted Answer · edited May 23 '17 at 12:24

Two solutions spring to mind. plyr::ddply is designed for your needs but using a data.table will be waaaaaay faster.

You want to take a data.frame split it up into chunks, remove all the bottom 25% of rows of each chunk which is sorted by date and recombine into a data.frame. This can be accomplished in one simple line...

require( plyr )
ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
#  Product Something       Date
#1 1000001    100005 2011-01-01
#2 1000001    100002 2011-01-02
#3 1000001    100006 2011-01-02
#4 1000001    100004 2011-01-04
#5 1000002    100007 2011-01-01
#6 1000002    100003 2011-01-04
#7 1000003    100002 2011-01-02
#8 1000003    100008 2011-01-04

`data.table` solution

For data.table you will need the latest development version from r-forge (due to us of negative subscript not being implemented in the CRAN version of data.table yet). Make sure you follow the install.package call to get the latest version...

install.packages( "data.table" , repos="http://r-forge.r-project.org" )
require( data.table )
DT <- data.table( input )

#  Sort by Product then Date very quickly
setkeyv( DT , c( "Product" , "Date" ) )

#  Return the bottom 75% of rows (i.e. not the earliest)
DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] 
#   Product Something       Date
#1: 1000001    100005 2011-01-01
#2: 1000001    100002 2011-01-02
#3: 1000001    100006 2011-01-02
#4: 1000001    100004 2011-01-04
#5: 1000002    100007 2011-01-01
#6: 1000002    100003 2011-01-04
#7: 1000003    100002 2011-01-02
#8: 1000003    100008 2011-01-04

A better way to use `data.table`

You could more easily do this (so you don't require development version of data.table)...

DT[ ,  .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ]

And you can also use lapply in the j argument (I was worried about my use of .SD) and this runs in ~ 14 seconds on a data.table of 2e6 rows with 90,000 products (groups)...

set.seed(1)
Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
dates <- sample( 1:20 , 2e6 , repl = TRUE )
Date <- as.Date( Sys.Date() + dates )
DT <- data.table( Product = Product , Date = Date )

system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` ,  -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
#   user  system elapsed 
#  14.65    0.03   14.74

Update: The best way to use `data.table`!

So thanks to @Arun (who is now an author of the data.table package) we now have the best way to use data.table which is to use .I which is an integer vector of all the row indices, subset in [ by removing the first 25% of record with -(1:ceiling(.N*.25)), and then performaing a subset using these row indices to get the final table. This is ~ 4-5 times faster than using my .SD method above. Amazing stuff!

system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] )
   user  system elapsed 
   3.02    0.00    3.03

ddply( input , .(Product) , function(x){ head(x, table_input[table_input$Var1 == x$Product[1],]$twentyfive ) } ) This seems to do the same as my whole for loop.. Thnx @Simon. Would you happen to know how I can compare the performance? — Freddy, Oct 17 '13 at 12:08
Thank you for all your help! I use stackoverflow alot to search for solutions, but you realy took it to a next level. Computing time you saved me with your data.table solution: appro. 720 mins with my (improved, not the one above) for-loop to 10 mins with data.table. — Freddy, Oct 17 '13 at 14:13
@Freddy I get it running in 14 seconds on a 2e6 row data.table with 90,000 products to group by. — Simon O'Hanlon, Oct 17 '13 at 14:43
@SimonO101, `nrow(.SD) = .N`. Also, you can replace `.SD` with `.I`. Using `.SD` would be time-consuming here. That is, `DT[DT[, .I[-(1:ceiling(.N*.25))], by=Product]$V1]` will be much faster. The key will be same as that of `DT` using this method. You can change that if you want. — Arun, Oct 17 '13 at 17:15
@Arun thanks for this great suggestion. I hope I have more or less explained it ok in the answer? Thanks! — Simon O'Hanlon, Oct 18 '13 at 09:03
@Arun Awesome, I should get to know data.table better, this works so well on large data sets. Subsetting 3 data sets (for 25,50 and 75) takes me 4 seconds now. Which makes it 10000 times faster then my initial 'solution'. — Freddy, Oct 18 '13 at 13:08
@SimonO101, no problem. Made a minor change. Hope it is okay. — Arun, Oct 18 '13 at 15:29
@Freddy, glad! get started with the introduction vignette and then probably the FAQ. Ask if you have doubts, here or on data.table mailing list. — Arun, Oct 18 '13 at 15:32

score 2 · Answer 2 · answered Oct 17 '13 at 14:17

Here is a way using mapply and your input and table_input:

    #your code
    #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
    #colnames(input) <- c( "Product" , "Something" ,"Date")
    #input <- as.data.frame(input)
    #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")

    #Sort based on date, I want to leave out the entries with the oldest dates.
    #input <- input[ with( input, order(Date)), ]

    #Create number of items I want to select
    #table_input <- as.data.frame(table(input$Product))
    #table_input$twentyfive <- ceiling( table_input$Freq*0.25  )

    #function to "mapply" on "table_input"
    fun = function(p, d) { grep(p, input$Product)[1:d] }

    #subset "input"
    input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]

       Product Something       Date
    1  1000001    100001 2011-01-01
    3  1000001    100003 2011-01-01
    7  1000002    100002 2011-01-01
    11 1000003    100003 2011-01-01

I, also, called system.time and replicate to compare speed of mapply and the alternatives from SimonO101's answer:

    #SimonO101's code
    #require( plyr )
    #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
    #install.packages( "data.table" , repos="http://r-forge.r-project.org" )
    #require( data.table )
    #DT <- data.table( input )
    #setkeyv( DT , c( "Product" , "Date" ) )
    #DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ]

    > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
       user  system elapsed 
       5.29    0.00    5.29 
    > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
      user  system elapsed 
      43.48    0.03   44.04 
    > system.time(replicate(10000,  DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] ))                        
      user  system elapsed 
      34.30    0.01   34.50

BUT: SimonO101's alternatives do not produce the same as mapply, becaused I used mapply using the table_input you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.

I don't think they are comparable - you get different (wrong according to OP) results. Also (and I don't know how this would affect timing) you should try it out on a low number of replications on a very large table, not many iterations on a small table (as that is what the OP problem stated - 2 million rows). — Simon O'Hanlon, Oct 17 '13 at 14:27
I tried your solution on a 2e6 row table and stopped it running after about 20 minutes. Try the two solutions using the data at the bottom of my answer. — Simon O'Hanlon, Oct 17 '13 at 15:01
I, also, just tried my solution for a 2-million row dataframe and `R` was not responding for a fair amount of time. :). I guess `mapply` is out of the picture now. — alexis_laz, Oct 17 '13 at 15:08

Efficiently selecting top number of rows for each unique value of a column in a data.frame

2 Answers2

`data.table` solution

A better way to use `data.table`

Update: The best way to use `data.table`!

Linked

Efficiently selecting top number of rows for each unique value of a column in a data.frame

2 Answers2

data.table solution

A better way to use data.table

Update: The best way to use data.table!

Linked

`data.table` solution

A better way to use `data.table`

Update: The best way to use `data.table`!