3

I have a vector of strings and I would like to hash each element individually to integers modulo n.

In this SO post it suggests an approach using digest and strotoi. But when I try it I get NA as the returned value

library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)

So the above approach will not work as it can not even produce an integer let alone modulo of one.

What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.

xiaodai
  • 14,889
  • 18
  • 76
  • 140

2 Answers2

2

R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi returns NA because the number is too big.

The mpfr-function from the Rmpfr package should work for you:

mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065
Birger
  • 1,111
  • 7
  • 17
1

I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.

To use it

if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)
xiaodai
  • 14,889
  • 18
  • 76
  • 140