1

I have a large data set (several million rows) stored as a data.table (I am currently learning to use the package). It is in a format like this (just the relevant columns):

DT=data.table("a"=c(11,11,11,12,13,13),"b"=c("x", "y", "z", "y", "x", "y"))
DT
   a  b
1: 11 x
2: 11 y
3: 11 z
4: 12 y
5: 13 x
6: 13 y

I am trying to build a function that based on input of value in "a", finds all the values associated with it in "b" (list A), and then searches through the remaining values in "a" and returns the one who has the most values in "b" matching to ones in list A.

e.g. for a=11, list A is "x, y, z", for a=13, list A is "x, y" the desired result for a=13 is a=11 (matches on 2 values - "x, y")

I am trying to find the most efficient way to do this. what I tried so far is:

toMatch=3
setkey(DT,a)
vals=DT[list(toMatch)]$b

vals
[1] "x" "y"

setkey(DT,b)
temp=data.table("listA"=DT[list(vals)]$a)
temp[,.N, by = names(temp)][order(-N)]

temp
   listA  N
1:     11 2
2:     31 2
3:     21 1

best=temp[listA!=toMatch][1]
   listA
1:    11

This gives the desired result, but requires resetting the key (which is not feasible each time) or constructing auxiliary tables, each with its own key. As previously mentioned, I am only now learning to use data.table - I am sure there is a more efficient way to achieve this.

Thanks for your help!

yarbaur
  • 75
  • 8
  • So ultimately, you simply trying to get the `a` value that has the most unique values in `b` ? – mtoto Apr 08 '16 at 10:55
  • not sure I understand your phrasing. I'm trying to find the value in `a` whose list of `b` values matches most closely the list of another value in `a`. – yarbaur Apr 08 '16 at 13:13
  • couldn't you just do `table(DT)` and go from there? – mtoto Apr 08 '16 at 13:40
  • You do not need to set the key for merges, thanks to a new feature in version 1.9.6: `X[Y, on=cols]`. I don't know the best reference for the new syntax, but this should help: http://stackoverflow.com/a/20057411/1191259 – Frank Apr 08 '16 at 15:35
  • @mtoto not very clear where/how you go from there...? – eddi Apr 08 '16 at 16:37

1 Answers1

0

This is just your algo, rewritten:

DT[, DT[.SD, on = 'b'][a != currval, .N, by = a][order(-N), .(bestmatch = a[1])]
   , by = .(currval = a)]
#   currval bestmatch
#1:      11        13
#2:      12        11
#3:      13        11
eddi
  • 49,088
  • 6
  • 104
  • 155