There is package named stringdist
in R which contains functions for computing Levenshtein string distance. I have two problems with this package:
1st It does not works with large strings e.g.:
set.seed(1)
a.str <- paste(sample(0:9, 100000, replace = T), collapse="")
set.seed(2)
b.str <- paste(sample(0:9, 100000, replace = T), collapse="")
stringdist(a.str, b.str, method = "lv")
# THE LAST COMMAND RESTARTS R SESSION
2nd Distances in vectors are computed per vector element's characters rather than per whole vector:
a.vec <- c(1, 2, 3, 4, 5, 666)
b.vec <- c(1, 2, 4, 3, 6, 777)
stringdist(a.vec, b.vec, method = "lv")
# [1] 0 0 1 1 1 3
I would like to have the result from last command 4: because 4 substitutions are needed (4 vector elements on corresponding positions are different). In this case I can fetch values which are non 0 and count them e.g.: r <- stringdist(a.vec, b.vec, method = "lv"); length(r[r!=0])
. But it does not works in following example:
a.vec <- c(1, 2, 3)
b.vec <- c(1, 2, 2, 3)
stringdist(a.vec, b.vec, method = "lv")
# [1] 0 0 1 1
# Warning message:
# In stringdist(a.vec, b.vec, method = "lv") :
# longer object length is not a multiple of shorter object length
I would like to have the result from last command 1 (insert 2 at 1st position in 1st vector).
PS There is also built in implementation but it also does not works with large strings (and to be honest I have no idea how it is working with vectors because I do not understand it's output):
adist(a.str,b.str, counts = T)
# Error in adist(a.str, b.str, counts = T) :
# 'Calloc' could not allocate memory (1410265409 of 8 bytes)
Is there any implementation (preferably in python, perl or R) which fulfills my requirements? Thank you very much.
PPS I have multiple files where each line contain numbers from 1 ~ 500 (this is why I need to treat e.g. 347 as one element and not as string composed of 3,4,7 because 3,4,7 are another separate numbers). Those files has ~ 250000 lines. And I want to know how similar those files are to each other. I guess that 10k*10k size is the problem. But here is mentioned Levenshtein algorithm which uses only 2*10k size (if both strings are 10k long). I guess the trick is that it only computes the result and forget HOW the result was computed, but this is OK for me. Hamming distance is not sufficient for me because I need to take into account insertions, deletions, substitutions, in Hamming those two strings 1234567890
0123456789
is completely different but in Levenshtein they are similar.