2

I'm doing some product association work where I have two large data.tables. One is a rules table (2.4m rows) and one is a customer product table (3m rows). Effectively what I want to do is merge the two together and select the top 10 products per customer, but doing this at the high level isn't viable due to the size. However, to get round this, I want to iteratively merge the two tables at a customer level, select the top 10 products and return it.

The below example probably explains it better:

require(data.table)

customer <- data.table(customer=rep(seq(1:5),3),product_bought=rep(c("A","B","C"),5), number=runif(15))[order(customer)]

rules <- data.table(product_bought=c("A","B","C"),recommended_product=c("D","E","F"),number2=runif(3,min=100,max=200))

customer[,lapply(.SD, function(z){
  a <- merge(z,rules,by="product_bought")
  a[,new:=number*number2]
  a[new==max(new)]
  return(a)
}),by=customer]

But I get the following error:

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid colum

What I want it to do for all customers is this:

z <- customer[customer==1]
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]

Which gives:

> a[new==max(new)]
   product_bought customer   number recommended_product  number2     new
1:  

        C        1           0.613043                F  168.4335     103.257

I did try using lists, but having a list of 30k data.tables had issues when trying to rbindlist it back up again.

Any ideas why the merge within a .SD doesn't work?

Cheers, Scott

1 Answers1

1

I guess you were trying to do this:

customer[, {
             a <- merge(.SD,rules,by="product_bought");
             a[, new:=number*number2];
             a[new==max(new)]
           }, by = customer]

But it's much better to do a single merge:

customer[rules, on = 'product_bought', new := number * number2]
customer[, .SD[new == max(new)], by = customer]

Or do the .I trick if the last line is too slow.

Community
  • 1
  • 1
eddi
  • 49,088
  • 6
  • 104
  • 155
  • The above works with copy around the .SD - thanks! If I do the single merge, I'll get a massive table (over 500m rows) which is what I was trying to avoid doing. – Scott Bradshaw Nov 16 '15 at 08:56