Find the minimum hamming distance between a string and a long vector of strings (fast)

Question

I need to calculate the hamming distance between an input string and a large string dataset. (All strings in the dataset have the same length of the input string.)

For example, if

input <- "YNYYEY"
dataset <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")

the hamming distance between input and each string in dataset is 1, 1, 3, 0 so the minimum is 0. I have written a function to calculate the hamming distance between two strings:

HD <- function(str1, str2){

   str1 <- as.character(str1)
   str2 <- as.character(str2)

   length.str1 <- nchar(str1)
   length.str2 <- nchar(str2)

   string.temp1 <- c()
   for (i in 1:length.str1){
     string.temp1[i] = substr(str1, start=i, stop=i)
   }
   string.temp2 <- c()
   for (i in 1:length.str2){
     string.temp2[i] = substr(str2, start=i, stop=i)
   }
   return(sum(string.temp1 != string.temp2))
   }

But the dataset is too big so I need to speed it up, do you have any idea that I can do it quickly? Thank you for your help.

score 2 · Answer 1 · answered Aug 19 '18 at 13:48

2

You cannot improve it better than O(n) it means that you have to review all the dataset, and calculate the distance for each observation.

The only improvement can happen on your dataset, if you sort all the observations based on a given point. In this case you maybe easier to find a string in dataset (0 distance results). This is the only improvement which you may do.

answered Aug 19 '18 at 13:48

Sal-laS

11,016
25
99
169

Thank you for your answer. And how about my implementation in R? Is that any possibility to make the code run faster? Or is that any algorithms can speed it up by converting the letter into the binary string? – Codezy Aug 19 '18 at 14:08
@Codezy The code by itself is not a problem. The main time consuming activity is to run it over all the dataset. So, the evaluation of distance is not a big deal here. – Sal-laS Aug 19 '18 at 14:13
I agree that, if the dataset is huge, it is not possible to sort the observations based on the `strings`. But, if it was designed like that, from beginning then it is doable. – Sal-laS Aug 19 '18 at 15:13

Zheyuan Li · Accepted Answer · 2018-08-21T18:11:35.977

At R level you can use strsplit, cbind, !=, colSums and min. They are all "vectorized".

a <- "YNYYEY"
b <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
A <- strsplit(a, split = "")[[1]]
#[1] "Y" "N" "Y" "Y" "E" "Y"
B <- do.call("cbind", strsplit(b, split = ""))
#     [,1] [,2] [,3] [,4]
#[1,] "Y"  "Y"  "Y"  "Y" 
#[2,] "N"  "N"  "N"  "N" 
#[3,] "Y"  "Y"  "E"  "Y" 
#[4,] "Y"  "Y"  "N"  "Y" 
#[5,] "E"  "Y"  "E"  "E" 
#[6,] "E"  "Y"  "N"  "Y" 
D <- colSums(A != B)
#[1] 1 1 3 0
min(D)
#[1] 0

This kind of "vectorization" creates many temporary matrices / vectors and uses plenty of RAM. But hopefully it is worthwhile.

At C / C++ level you can do much much better (see a case study at here), but I am not keen on writing C / C++ code today.

I come across the stringdist package (there is even a stringdist tag). The function stringdist relies on a workhorse routine stringdist:::do_dist, which is written in C. It spares my effort.

library(stringdist)
d <- stringdist(a, b, method = "hamming")
#[1] 1 1 3 0
min(d)
#[1] 0

stringdist() runs almost ten times slower than colSum().

That is really interesting. Probably its C code or R code is doing something else complicated.

Find the minimum hamming distance between a string and a long vector of strings (fast)

2 Answers2