1

I need to compute the (scaled) Hamming string distance d(x,y) = #{x_i != y_i : i = 1,...,n}/n where x and y are strings of length n. I use R and dplyr/tidyverse and defined the Hamming distance as

hamdist = function(x,y) mean(str_split(x, "")[[1]] != str_split(y, "")[[1]])

This works perfectly fine. However, since I want to apply it columnwise, I have to use the rowwise verb (or use map2 from purrr package). The problem: my data set contains ~50 mio observations and the calculations thus takes hours.

My question is therefore: is there a smoother/more efficient way to implement the Hamming string distance for column operations?

(dplyr solutions are preferable)

An example:

n = 1000
l = 8

rstr = function(n, l = 1) replicate(n, paste0(letters[floor(runif(l, 1, 27))], collapse = ""))

hamdist = function(x,y) mean(str_split(x, "")[[1]] != str_split(y, "")[[1]])

df = tibble(a = rstr(n, l), b = rstr(n, l))

df %>% mutate(dist = hamdist(a, b)) # wrong!
df %>% rowwise() %>% mutate(dist = hamdist(a, b)) # correct! but slow for n = 50 mio
Syd Amerikaner
  • 195
  • 1
  • 8

1 Answers1

2

See the stringdist package. Function stringdist takes a method argument that can be "hamming". The stringdist package claims to be:

Built for speed, using openMP for parallel computing.

Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • 1
    thank you. This function in fact runs very fast: `> system.time(df %>% mutate(dist = stringdist(a, b, method = "hamming")/8)) user system elapsed 0.002 0.000 0.001` and `> system.time(df %>% rowwise() %>% mutate(dist = hamdist(a, b))) user system elapsed 1.082 0.020 1.102 ` (for n = 10000) – Syd Amerikaner Apr 26 '19 at 10:55