7

I have a vector of strings x = c("hello", "world") and another vector y = c("hello", "world", "how", "are", "you"). I want to see which elements of x are inside y. For small vector this could easily be done using x %in% y. However I am looking for a more efficient way to do this - normally we would sort y first in O(n log n) time, then foreach string inside x we can do lookup in O(log n) time. I am worried that %in% is doing a full pass over y for each x it is looking up.

Is there a way to take advantage of sort and binary search in R? Or is there a way to build a hashset from y for fast lookup times?

JCWong
  • 1,163
  • 1
  • 10
  • 29
  • The `data.table` package implements binary search--see http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf – Richard Border Jan 30 '15 at 00:16
  • 2
    Have you don't any benchmarking of your own? Match() (and `%in%`) will build a hash table of it's own for long enough vectors. Are you matching repeatedly against the same values, or is this a one time match that does them all? – MrFlick Jan 30 '15 at 00:16
  • also see package `hash` http://cran.r-project.org/web/packages/hash/hash.pdf – Richard Border Jan 30 '15 at 00:17
  • This might be helpful - [Find closest value in a vector with binary search](http://stackoverflow.com/questions/20133344/find-closest-value-in-a-vector-with-binary-search) – thelatemail Jan 30 '15 at 00:24
  • 2
    The `%chin%` function in `data.table` is supposedly optimized for character vectors – shadowtalker Jan 30 '15 at 05:48
  • Indeed, %chin% was the fastest way to search character vectors – JCWong Feb 01 '15 at 22:07

0 Answers0