I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Selecting top finite number of rows for each unique value of a column in a data fame in R However, i want to vary the number of items selected by the head() command.
#Sample data
input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
colnames(input) <- c( "Product" , "Something" ,"Date")
input <- as.data.frame(input)
input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
table_input <- as.data.frame(table(input$Product))
table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#This next part is a very time consuming method (Have 2 mln rows, 90k different products)
first <- TRUE
for( i in table_input$Var1 ) {
data_selected <- input[input$Product == i,]
number <- table_input[table_input$Var1 == i ,]$twentyfive
head <- head( data_selected, number)
if( first == FALSE) {
output <- rbind(output, head)
} else {
output <- head
}
first <- FALSE
}
Hoping that someone knows a better, more efficient way. I tried to use the split function from the answer here: Selecting top finite number of rows for each unique value of a column in a data fame in R to split on the products and then try to iterate over them and select the head(). However the split function always runs out of memory (cannot allocate ..)
input_split <- split(input, input$Product) #Works here, but not i my problem.
So in the end my problem is that i want te select a different amount of each unique Product. So here 2 items from 1000001 and 1 item from 1000002 and 1000003.