191

I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.

I want to do fuzzy string comparison, but I'm not sure which library to use.

Option 1:

import Levenshtein
Levenshtein.ratio('hello world', 'hello')

Result: 0.625

Option 2:

import difflib
difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()

Result: 0.625

In this example both give the same answer. Do you think both perform alike in this case?

NoDataDumpNoContribution
  • 10,591
  • 9
  • 64
  • 104
Maggie
  • 5,923
  • 8
  • 41
  • 56

2 Answers2

229

In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

I then plotted the results with R:

enter image description here

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

Result: enter image description here

The Difflib / Levenshtein similarity really is quite interesting.

2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext

duhaime
  • 25,611
  • 17
  • 169
  • 224
  • 4
    This is super cool! What is your take on this then? Is Levenshtein just bad for title-length strings? – Ulf Aslak Nov 25 '15 at 15:42
  • 3
    It really depends what you're trying to capture in your similarity metric... – duhaime Nov 25 '15 at 18:03
  • 3
    I think some of the disagreement between difflib and levenshtein may be explained because of the autojunk heuristic used by difflib. What happens if you disable it? – Michael Jun 23 '16 at 14:55
  • 2
    That's a good question. The autojunk filter only takes effect if the number of observations is >200, so I'm not sure if this particular dataset (book titles) would have been greatly affected, but it's worth investigation... – duhaime Jun 23 '16 at 22:29
  • Can you share the dataset with me – Abhishek Gupta Sep 09 '16 at 02:02
  • Sorry Abhishek, the data I used is owned by a commercial vendor (ProQuest). The HathiTrust Research Center has an API you could use to obtain a similar data set. Really any sufficiently large body of natural language would work (just partition it into strings roughly title length). – duhaime Sep 09 '16 at 02:05
  • 7
    @duhaime, thank you for this detailed analysis. I'm new to these kinds of plots and have no idea how to interpret them. What are the plots called, so that I may look them up and learn about them? – Zach Young Jul 06 '17 at 14:24
  • 2
    Looking at the first graph, it's comparing how differently the levenshtein and difflib `ratio()` functions each report a difference. If both were equal, there would only be dots on the x=y slope, correct? But these plots show that levenshtein is reporting any given difference as "more similar" than difflib: the levenshtein threshold of .75 across the entire range of difflib? – Zach Young Jul 06 '17 at 14:46
  • 2
    @ZacharyYoung The first plot above is a scatterplot, and the second is sometimes called a "plot matrix". On your second note, I'd say the first plot suggests that levenshtein similarity is always >= difflib similarity. – duhaime Jul 06 '17 at 15:03
  • 1
    @duhaime, right, levenshtein similarity >= difflib. Thank you! – Zach Young Jul 06 '17 at 15:23
  • why do you think "The Difflib / Levenshtein similarity really is quite interesting. " and not difflib/Sorensen or difflib/jaccard ? – J. Doe Aug 23 '17 at 13:50
  • could you give an example about using minhashing for comparing two strings? – Saeedeh Aug 25 '19 at 12:13
  • @Saeedeh sure thing, check this out: https://stackoverflow.com/questions/25114338/approximate-string-matching-using-lsh/41792983#41792983 – duhaime Aug 25 '19 at 12:41
  • @duhaime - thank you for this. Couple of questions - 1) is there any ground truth data to understand which performed better at fuzzy matching? 2) is there a shorter overview than the ~500 page book you linked there? – Joey Baruch Jan 05 '22 at 20:32
  • 1
    @JoeyBaruch this is kind of an open problem in NLP to the best of my knowledge. There are some corpora for plagiarism detection, but the problem is that similarity is a continuous measure, so ground truth would require us to objectively assign continuous similarity to two strings, which is hard/impossible? I would read just Chapter 3 of the book above, which builds up to the minhashing algorithm. Or just read about minhashing, e.g. this great post: https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/ – duhaime Jan 06 '22 at 13:11
  • 1
    @duhaime, I guess it's just hard for me to look at those graphs and say something like `Levenstein is better than difflib`. It's clear that one will produce higher scores, but that does not mean more accuracy. Is there any metric for accuracy available? – Joey Baruch Jan 06 '22 at 20:34
  • Yes, I agree, any notion of "better" here can only be discussed with respect to a particular goal. I think one can say generally Levenstein is faster than difflib, and difflib is fuzzier than Levenstein. The only notions of accuracy here are specified exactly in the distance functions themselves, i.e. the levenstein and difflib.SequenceMatcher functions themselves... – duhaime Jan 07 '22 at 17:20
134
  • difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.

  • Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance

According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much faster.

marc_aragones
  • 4,344
  • 4
  • 26
  • 38
Ghassen Hamrouni
  • 3,138
  • 2
  • 20
  • 31
  • 1
    Thanks alot for the info. I have added more details. here it is: `I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.` Do you think both perform alike in this case. – Maggie Jul 14 '11 at 09:29