2

I need to calculate the hamming distance between an input string and a large string dataset. (All strings in the dataset have the same length of the input string.)

For example, if

input <- "YNYYEY"
dataset <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")

the hamming distance between input and each string in dataset is 1, 1, 3, 0 so the minimum is 0. I have written a function to calculate the hamming distance between two strings:

HD <- function(str1, str2){

   str1 <- as.character(str1)
   str2 <- as.character(str2)

   length.str1 <- nchar(str1)
   length.str2 <- nchar(str2)

   string.temp1 <- c()
   for (i in 1:length.str1){
     string.temp1[i] = substr(str1, start=i, stop=i)
   }
   string.temp2 <- c()
   for (i in 1:length.str2){
     string.temp2[i] = substr(str2, start=i, stop=i)
   }
   return(sum(string.temp1 != string.temp2))
   }

But the dataset is too big so I need to speed it up, do you have any idea that I can do it quickly? Thank you for your help.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Codezy
  • 662
  • 5
  • 17

2 Answers2

2

You cannot improve it better than O(n) it means that you have to review all the dataset, and calculate the distance for each observation.

The only improvement can happen on your dataset, if you sort all the observations based on a given point. In this case you maybe easier to find a string in dataset (0 distance results). This is the only improvement which you may do.

Sal-laS
  • 11,016
  • 25
  • 99
  • 169
  • Thank you for your answer. And how about my implementation in R? Is that any possibility to make the code run faster? Or is that any algorithms can speed it up by converting the letter into the binary string? – Codezy Aug 19 '18 at 14:08
  • @Codezy The code by itself is not a problem. The main time consuming activity is to run it over all the dataset. So, the evaluation of distance is not a big deal here. – Sal-laS Aug 19 '18 at 14:13
  • I agree that, if the dataset is huge, it is not possible to sort the observations based on the `strings`. But, if it was designed like that, from beginning then it is doable. – Sal-laS Aug 19 '18 at 15:13
2

At R level you can use strsplit, cbind, !=, colSums and min. They are all "vectorized".

a <- "YNYYEY"
b <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
A <- strsplit(a, split = "")[[1]]
#[1] "Y" "N" "Y" "Y" "E" "Y"
B <- do.call("cbind", strsplit(b, split = ""))
#     [,1] [,2] [,3] [,4]
#[1,] "Y"  "Y"  "Y"  "Y" 
#[2,] "N"  "N"  "N"  "N" 
#[3,] "Y"  "Y"  "E"  "Y" 
#[4,] "Y"  "Y"  "N"  "Y" 
#[5,] "E"  "Y"  "E"  "E" 
#[6,] "E"  "Y"  "N"  "Y" 
D <- colSums(A != B)
#[1] 1 1 3 0
min(D)
#[1] 0

This kind of "vectorization" creates many temporary matrices / vectors and uses plenty of RAM. But hopefully it is worthwhile.

At C / C++ level you can do much much better (see a case study at here), but I am not keen on writing C / C++ code today.


I come across the stringdist package (there is even a tag). The function stringdist relies on a workhorse routine stringdist:::do_dist, which is written in C. It spares my effort.

library(stringdist)
d <- stringdist(a, b, method = "hamming")
#[1] 1 1 3 0
min(d)
#[1] 0

stringdist() runs almost ten times slower than colSum().

That is really interesting. Probably its C code or R code is doing something else complicated.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248