I'm doing some product association work where I have two large data.tables. One is a rules table (2.4m rows) and one is a customer product table (3m rows). Effectively what I want to do is merge the two together and select the top 10 products per customer, but doing this at the high level isn't viable due to the size. However, to get round this, I want to iteratively merge the two tables at a customer level, select the top 10 products and return it.
The below example probably explains it better:
require(data.table)
customer <- data.table(customer=rep(seq(1:5),3),product_bought=rep(c("A","B","C"),5), number=runif(15))[order(customer)]
rules <- data.table(product_bought=c("A","B","C"),recommended_product=c("D","E","F"),number2=runif(3,min=100,max=200))
customer[,lapply(.SD, function(z){
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]
return(a)
}),by=customer]
But I get the following error:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid colum
What I want it to do for all customers is this:
z <- customer[customer==1]
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]
Which gives:
> a[new==max(new)]
product_bought customer number recommended_product number2 new
1:
C 1 0.613043 F 168.4335 103.257
I did try using lists, but having a list of 30k data.tables had issues when trying to rbindlist it back up again.
Any ideas why the merge within a .SD doesn't work?
Cheers, Scott