Data.table - lapply function within .SD with merge not working

Question

I'm doing some product association work where I have two large data.tables. One is a rules table (2.4m rows) and one is a customer product table (3m rows). Effectively what I want to do is merge the two together and select the top 10 products per customer, but doing this at the high level isn't viable due to the size. However, to get round this, I want to iteratively merge the two tables at a customer level, select the top 10 products and return it.

The below example probably explains it better:

require(data.table)

customer <- data.table(customer=rep(seq(1:5),3),product_bought=rep(c("A","B","C"),5), number=runif(15))[order(customer)]

rules <- data.table(product_bought=c("A","B","C"),recommended_product=c("D","E","F"),number2=runif(3,min=100,max=200))

customer[,lapply(.SD, function(z){
  a <- merge(z,rules,by="product_bought")
  a[,new:=number*number2]
  a[new==max(new)]
  return(a)
}),by=customer]

But I get the following error:

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid colum

What I want it to do for all customers is this:

z <- customer[customer==1]
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]

Which gives:

> a[new==max(new)]
   product_bought customer   number recommended_product  number2     new
1:  

        C        1           0.613043                F  168.4335     103.257

I did try using lists, but having a list of 30k data.tables had issues when trying to rbindlist it back up again.

Any ideas why the merge within a .SD doesn't work?

Cheers, Scott

When you have your list of 30k data.tables what error do you get when trying to rbindlist them back together? — Dean MacGregor, Nov 13 '15 at 15:12
`code` "malloc failed in rbindlist.c. This part of the code will be reworked." — Scott Bradshaw, Nov 16 '15 at 08:46

score 1 · Answer 1 · edited May 23 '17 at 11:59

1

I guess you were trying to do this:

customer[, {
             a <- merge(.SD,rules,by="product_bought");
             a[, new:=number*number2];
             a[new==max(new)]
           }, by = customer]

But it's much better to do a single merge:

customer[rules, on = 'product_bought', new := number * number2]
customer[, .SD[new == max(new)], by = customer]

Or do the .I trick if the last line is too slow.

edited May 23 '17 at 11:59

Community

1
1

answered Nov 13 '15 at 18:14

eddi

49,088
6
104
155

The above works with copy around the .SD - thanks! If I do the single merge, I'll get a massive table (over 500m rows) which is what I was trying to avoid doing. – Scott Bradshaw Nov 16 '15 at 08:56

Data.table - lapply function within .SD with merge not working

1 Answers1