I have a large data set (several million rows) stored as a data.table (I am currently learning to use the package). It is in a format like this (just the relevant columns):
DT=data.table("a"=c(11,11,11,12,13,13),"b"=c("x", "y", "z", "y", "x", "y"))
DT
a b
1: 11 x
2: 11 y
3: 11 z
4: 12 y
5: 13 x
6: 13 y
I am trying to build a function that based on input of value in "a", finds all the values associated with it in "b" (list A), and then searches through the remaining values in "a" and returns the one who has the most values in "b" matching to ones in list A.
e.g. for a=11, list A is "x, y, z", for a=13, list A is "x, y" the desired result for a=13 is a=11 (matches on 2 values - "x, y")
I am trying to find the most efficient way to do this. what I tried so far is:
toMatch=3
setkey(DT,a)
vals=DT[list(toMatch)]$b
vals
[1] "x" "y"
setkey(DT,b)
temp=data.table("listA"=DT[list(vals)]$a)
temp[,.N, by = names(temp)][order(-N)]
temp
listA N
1: 11 2
2: 31 2
3: 21 1
best=temp[listA!=toMatch][1]
listA
1: 11
This gives the desired result, but requires resetting the key (which is not feasible each time) or constructing auxiliary tables, each with its own key. As previously mentioned, I am only now learning to use data.table - I am sure there is a more efficient way to achieve this.
Thanks for your help!