1

How can i do Levenshtein distance measurement on word (not character) Level in R?

See the following:

Expected result 1)

# levenshtein operations needed: Delete*2 --> 2 operations
array1 <- c("word", "car")
array2 <- c("word", "pool", "beer", "car")

I am seeking a function levenshtein(), so that the distance of 2 is returned for the example above:

levenshtein(array1, array2)
--> 2

Expected result 2)

# levenshtein operations needed: Delete and insert --> 2 operations
array1 <- c("word", "car", "pool")
array2 <- c("word", "pool", "car")

I am seeking a function levenshtein(), so that the distance of 2 is returned for the example above:

levenshtein(array1, array2)
--> 2

I found the following: Word-level edit distance of a sentence But i didnt find a working needleman-wunsch implentation that yields the expected results, described above.

Tlatwork
  • 1,445
  • 12
  • 35

2 Answers2

0

Not entirely sure what you're asking... but is this what you are after?

lapply(array1, function(i){
    m <- drop(attr(adist(i, array2, counts = TRUE), "counts")) 
    row.names(m) <- array2
    setNames(list(m %>% as.data.frame()), i)
}) %>% unlist(recursive = FALSE)
$word
     ins del sub
word   0   0   0
pool   0   0   3
beer   1   1   2
car    0   1   2

$car
     ins del sub
word   1   0   2
pool   1   0   3
beer   1   0   2
car    0   0   0
Carl Boneri
  • 2,632
  • 1
  • 13
  • 15
  • hi Carl, thanks for your Response. I am not sure your answer would return the distance of 2 for both examples? I made an edit to the example in case it was unclear. – Tlatwork Oct 17 '17 at 16:39
  • I'm not understanding what you are asking I think. Are you looking to find all occurances in array2 that have a combined edit distance of 2, such that `insertions + deletions = 2`? What's the actual goal here; in lay-terms @ThanksGuys ? – Carl Boneri Oct 17 '17 at 16:43
0

We can map the unique words to letters, and use adist as the engine for the generalized edit distance.

levenshtein <- function(x, y){
  unique_words <- unique(c(x,y))
  letter_x <- plyr::mapvalues(x,
                              from = unique_words,
                              to = letters[1:length(unique_words)])
  letter_y <- plyr::mapvalues(y,
                              from = unique_words,
                              to = letters[1:length(unique_words)])
  adist(paste0(letter_x,collapse=''),paste0(letter_y,collapse=''))
}

array1 <- c("word", "car")
array2 <- c("word", "pool", "beer", "car")

levenshtein(array1, array2)


array1 <- c("word", "car", "pool")
array2 <- c("word", "pool", "car")

levenshtein(array1, array2)

Apparently, this function can only work for two character vectors with less than or equal to 26 unique words, you can generalize it to 52 (adding upper case Letters), or 62 (digits), etc. ...

The better approach apparently would be rewriting the adist function...

platypus
  • 516
  • 3
  • 8