using data.table to subset and aggregate efficiently

Question

I have a large data set (several million rows) stored as a data.table (I am currently learning to use the package). It is in a format like this (just the relevant columns):

DT=data.table("a"=c(11,11,11,12,13,13),"b"=c("x", "y", "z", "y", "x", "y"))
DT
   a  b
1: 11 x
2: 11 y
3: 11 z
4: 12 y
5: 13 x
6: 13 y

I am trying to build a function that based on input of value in "a", finds all the values associated with it in "b" (list A), and then searches through the remaining values in "a" and returns the one who has the most values in "b" matching to ones in list A.

e.g. for a=11, list A is "x, y, z", for a=13, list A is "x, y" the desired result for a=13 is a=11 (matches on 2 values - "x, y")

I am trying to find the most efficient way to do this. what I tried so far is:

toMatch=3
setkey(DT,a)
vals=DT[list(toMatch)]$b

vals
[1] "x" "y"

setkey(DT,b)
temp=data.table("listA"=DT[list(vals)]$a)
temp[,.N, by = names(temp)][order(-N)]

temp
   listA  N
1:     11 2
2:     31 2
3:     21 1

best=temp[listA!=toMatch][1]
   listA
1:    11

This gives the desired result, but requires resetting the key (which is not feasible each time) or constructing auxiliary tables, each with its own key. As previously mentioned, I am only now learning to use data.table - I am sure there is a more efficient way to achieve this.

Thanks for your help!

So ultimately, you simply trying to get the `a` value that has the most unique values in `b` ? — mtoto, Apr 08 '16 at 10:55
not sure I understand your phrasing. I'm trying to find the value in `a` whose list of `b` values matches most closely the list of another value in `a`. — yarbaur, Apr 08 '16 at 13:13
You do not need to set the key for merges, thanks to a new feature in version 1.9.6: `X[Y, on=cols]`. I don't know the best reference for the new syntax, but this should help: http://stackoverflow.com/a/20057411/1191259 — Frank, Apr 08 '16 at 15:35

score 0 · Answer 1 · answered Apr 08 '16 at 16:30

0

This is just your algo, rewritten:

DT[, DT[.SD, on = 'b'][a != currval, .N, by = a][order(-N), .(bestmatch = a[1])]
   , by = .(currval = a)]
#   currval bestmatch
#1:      11        13
#2:      12        11
#3:      13        11

answered Apr 08 '16 at 16:30

eddi

49,088
6
104
155

using data.table to subset and aggregate efficiently

1 Answers1