Levenshtein implementation capable working with large strings and vectors

Question

There is package named stringdist in R which contains functions for computing Levenshtein string distance. I have two problems with this package:

1st It does not works with large strings e.g.:

set.seed(1)
a.str <- paste(sample(0:9, 100000, replace = T), collapse="")

set.seed(2)
b.str <- paste(sample(0:9, 100000, replace = T), collapse="")

stringdist(a.str, b.str, method = "lv")
# THE LAST COMMAND RESTARTS R SESSION

2nd Distances in vectors are computed per vector element's characters rather than per whole vector:

a.vec <- c(1, 2, 3, 4, 5, 666)
b.vec <- c(1, 2, 4, 3, 6, 777)
stringdist(a.vec, b.vec, method = "lv")
# [1] 0 0 1 1 1 3

I would like to have the result from last command 4: because 4 substitutions are needed (4 vector elements on corresponding positions are different). In this case I can fetch values which are non 0 and count them e.g.: r <- stringdist(a.vec, b.vec, method = "lv"); length(r[r!=0]). But it does not works in following example:

a.vec <- c(1, 2, 3)
b.vec <- c(1, 2, 2, 3)
stringdist(a.vec, b.vec, method = "lv")
# [1] 0 0 1 1
# Warning message:
# In stringdist(a.vec, b.vec, method = "lv") :
#   longer object length is not a multiple of shorter object length

I would like to have the result from last command 1 (insert 2 at 1st position in 1st vector).

PS There is also built in implementation but it also does not works with large strings (and to be honest I have no idea how it is working with vectors because I do not understand it's output):

adist(a.str,b.str, counts = T)
# Error in adist(a.str, b.str, counts = T) : 
#   'Calloc' could not allocate memory (1410265409 of 8 bytes)

Is there any implementation (preferably in python, perl or R) which fulfills my requirements? Thank you very much.

PPS I have multiple files where each line contain numbers from 1 ~ 500 (this is why I need to treat e.g. 347 as one element and not as string composed of 3,4,7 because 3,4,7 are another separate numbers). Those files has ~ 250000 lines. And I want to know how similar those files are to each other. I guess that 10k*10k size is the problem. But here is mentioned Levenshtein algorithm which uses only 2*10k size (if both strings are 10k long). I guess the trick is that it only computes the result and forget HOW the result was computed, but this is OK for me. Hamming distance is not sufficient for me because I need to take into account insertions, deletions, substitutions, in Hamming those two strings 1234567890 0123456789 is completely different but in Levenshtein they are similar.

100000 * 100000 is 10GB. Not sure what your goal is. Why would you want to compute `stringdist` on such large strings? — Gopala, Apr 26 '16 at 12:19
For example, `method=JW` on the same two strings in `stringdist` produces a result. Algorithm is different, not needing square amount of memory. — Gopala, Apr 26 '16 at 12:23
As for your other problem of vectorized approach `stringdist` is taking, that is a very good thing for most use cases. If you really want it not to treat those two inputs as vectors, you can use `stringdist(paste(a.vec, collapse = ''), paste(b.vec, collapse = ''), method = "lv")` — Gopala, Apr 26 '16 at 12:50
This may help: http://stackoverflow.com/questions/4057513/levenshtein-distance-algorithm-better-than-onm — Gopala, Apr 26 '16 at 12:51
@Gopala I can e.g. prefix all elements with zeros and get e.g. `007` and `455` but the problem is that it even more increases already large strings so I would like to avoid this and use vectors instead. — Wakan Tanka, Apr 26 '16 at 12:56
@Gopala mentioned link looks similar to what I've posted but I need working implementation and I do not have enough time resources to develop/debug/test it myself, so I would rather use something mature. — Wakan Tanka, Apr 26 '16 at 12:58
I found the solution for you! :) Try `levenshteinDist` from the `RecordLinkage` package in R. It is using the memory efficient version. You still need to use `paste` to convert your vectors to strings to feed it. — Gopala, Apr 26 '16 at 13:02
About your 2nb problem: a possible workaround is to map your vectors to strings of Unicode characters: http://stackoverflow.com/questions/43232409/levenstein-edit-distance-for-arbitrary-sequences — Scarabee, Apr 05 '17 at 16:29

Gopala · Answer 1 · 2016-04-26T13:23:33.393

1

Here is a solution to memory problem:

library(RecordLinkage)

set.seed(1)
a.str <- paste(sample(0:9, 100000, replace = T), collapse="")
set.seed(2)
b.str <- paste(sample(0:9, 100000, replace = T), collapse="")
levenshteinDist(a.str, b.str)
[1] 73969

Still need to convert vectors to strings by using paste as that is not automatically assumed by the package. Most use cases want a vectorized operation.

See below for a way to get them to be treated as strings instead:

a.vec <- c(1, 2, 3, 4, 5, 666)
b.vec <- c(1, 2, 4, 3, 6, 777)
levenshteinDist(paste(a.vec, collapse = ''), paste(b.vec, collapse = ''))
[1] 5

edited Apr 26 '16 at 13:23

answered Apr 26 '16 at 13:04

Gopala

10,363
7
45
77

thank you, I've up voted, but I still need vectors. May I ask how did you find the package? – Wakan Tanka Apr 26 '16 at 13:18
I used it in the past for record linking and had forgotten. So, reloaded and tried. Will edit to add the vector aspect. Hope that meets your needs. – Gopala Apr 26 '16 at 13:21
It is not working in all cases. Check my question here: http://cs.stackexchange.com/questions/56612/levenshtein-distance-cabable-working-with-large-vectors-not-strings – Wakan Tanka Apr 27 '16 at 20:22

Levenshtein implementation capable working with large strings and vectors

1 Answers1