Is there a standard way in Python to fuzzy match a string with arbitrary list of acceptable values?

Question

I am hoping for a function like this:

def findSimilar(string, options):
    ....
    return aString

Where aString is similar to the passed string but is present in options. I'm using this function to normalize user input from the toy application I'm working on. I read about using levenshtein distance, but I decided to ask here, as I'm hoping there is a simple solution in the Python standard libraries.

It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. — Martijn Pieters, Apr 18 '13 at 18:00
@MartijnPieters I am looking for a more "subjective" answer. Since I have absolutely no idea how to go about programming this answer. — Others, Apr 18 '13 at 18:04
For "similar", do you mean that the most characters in common(dg is like djgk), or the one that has the largest string in common (fghi is like abcdefg) — erdekhayser, Apr 18 '13 at 19:15

score 8 · Accepted Answer · answered Apr 18 '13 at 19:44

Use difflib.get_close_matches.

get_close_matches(word, possibilities[, n][, cutoff])

Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

score 4 · Answer 2 · answered Apr 18 '13 at 18:03

4

Calculate the Levenshtein distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

There are already python implementations, although I don't know about their quality...

answered Apr 18 '13 at 18:03

Benjamin

609
3
8

score 4 · Answer 3 · edited May 23 '17 at 12:17

4

I think you may want to take a look at this post. You just need a fuzzy string comparator.

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

edited May 23 '17 at 12:17

Community

1
1

answered Apr 18 '13 at 18:03

Jerry Meng

1,466
3
21
40

score 2 · Answer 4 · answered Mar 26 '17 at 14:14

2

I would suggest using from fuzzywuzzy Seat Geek. They have a fantastic function called process that does exactly what you are looking for from their website, but adapted to your question:

    string = "new york jets"
    options = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
    process.extract(string, options, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]

answered Mar 26 '17 at 14:14

Robert Minnich

31
1
6

1

Note that fuzzywuzzy uses Levenshtein Distance, as suggested. – nealmcb Nov 26 '17 at 05:39

score 0 · Answer 5 · answered Apr 18 '13 at 19:37

From the description of your question, you don't need any kind of string similarity, you just need to know if the input string is in the list. For that just use a set instead, and test to see if the string is in the set, like this:

def isStringAcceptable(string, set):
    return string in set

If you want to be tolerant to users inputting the wrong string, you need to decide what kind of errors you are going to tolerate. Using something like Levinshtein distance might be severe overkill for what you want, and it might give you funny results. If you just want to check for casing, then call string.lower() and make sure all of the strings in your set are lower case. You probably don't need something so fancy as a string similarity metric.

Is there a standard way in Python to fuzzy match a string with arbitrary list of acceptable values?

5 Answers5