R fastest way to do lookup for vector of strings

Question

I have a vector of strings x = c("hello", "world") and another vector y = c("hello", "world", "how", "are", "you"). I want to see which elements of x are inside y. For small vector this could easily be done using x %in% y. However I am looking for a more efficient way to do this - normally we would sort y first in O(n log n) time, then foreach string inside x we can do lookup in O(log n) time. I am worried that %in% is doing a full pass over y for each x it is looking up.

Is there a way to take advantage of sort and binary search in R? Or is there a way to build a hashset from y for fast lookup times?

The `data.table` package implements binary search--see http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf — Richard Border, Jan 30 '15 at 00:16
Have you don't any benchmarking of your own? Match() (and `%in%`) will build a hash table of it's own for long enough vectors. Are you matching repeatedly against the same values, or is this a one time match that does them all? — MrFlick, Jan 30 '15 at 00:16
also see package `hash` http://cran.r-project.org/web/packages/hash/hash.pdf — Richard Border, Jan 30 '15 at 00:17
This might be helpful - [Find closest value in a vector with binary search](http://stackoverflow.com/questions/20133344/find-closest-value-in-a-vector-with-binary-search) — thelatemail, Jan 30 '15 at 00:24
The `%chin%` function in `data.table` is supposedly optimized for character vectors — shadowtalker, Jan 30 '15 at 05:48
Indeed, %chin% was the fastest way to search character vectors — JCWong, Feb 01 '15 at 22:07

R fastest way to do lookup for vector of strings

0 Answers0