I need to compute the (scaled) Hamming string distance
d(x,y) = #{x_i != y_i : i = 1,...,n}/n
where x
and y
are strings of length n
. I use R and dplyr/tidyverse and defined the Hamming distance as
hamdist = function(x,y) mean(str_split(x, "")[[1]] != str_split(y, "")[[1]])
This works perfectly fine. However, since I want to apply it columnwise, I have to use the rowwise
verb (or use map2
from purrr package). The problem: my data set contains ~50 mio observations and the calculations thus takes hours.
My question is therefore: is there a smoother/more efficient way to implement the Hamming string distance for column operations?
(dplyr solutions are preferable)
An example:
n = 1000
l = 8
rstr = function(n, l = 1) replicate(n, paste0(letters[floor(runif(l, 1, 27))], collapse = ""))
hamdist = function(x,y) mean(str_split(x, "")[[1]] != str_split(y, "")[[1]])
df = tibble(a = rstr(n, l), b = rstr(n, l))
df %>% mutate(dist = hamdist(a, b)) # wrong!
df %>% rowwise() %>% mutate(dist = hamdist(a, b)) # correct! but slow for n = 50 mio