4

I am hoping for a function like this:

def findSimilar(string, options):
    ....
    return aString

Where aString is similar to the passed string but is present in options. I'm using this function to normalize user input from the toy application I'm working on. I read about using levenshtein distance, but I decided to ask here, as I'm hoping there is a simple solution in the Python standard libraries.

Others
  • 2,876
  • 2
  • 30
  • 52
  • 5
    It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. – Martijn Pieters Apr 18 '13 at 18:00
  • @MartijnPieters I am looking for a more "subjective" answer. Since I have absolutely no idea how to go about programming this answer. – Others Apr 18 '13 at 18:04
  • For "similar", do you mean that the most characters in common(dg is like djgk), or the one that has the largest string in common (fghi is like abcdefg) – erdekhayser Apr 18 '13 at 19:15

5 Answers5

8

Use difflib.get_close_matches.

get_close_matches(word, possibilities[, n][, cutoff])

Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

shx2
  • 61,779
  • 13
  • 130
  • 153
4

Calculate the Levenshtein distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

There are already python implementations, although I don't know about their quality...

Benjamin
  • 609
  • 3
  • 8
4

I think you may want to take a look at this post. You just need a fuzzy string comparator.

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

Community
  • 1
  • 1
Jerry Meng
  • 1,466
  • 3
  • 21
  • 40
2

I would suggest using from fuzzywuzzy Seat Geek. They have a fantastic function called process that does exactly what you are looking for from their website, but adapted to your question:

    string = "new york jets"
    options = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
    process.extract(string, options, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
0

From the description of your question, you don't need any kind of string similarity, you just need to know if the input string is in the list. For that just use a set instead, and test to see if the string is in the set, like this:

def isStringAcceptable(string, set):
    return string in set

If you want to be tolerant to users inputting the wrong string, you need to decide what kind of errors you are going to tolerate. Using something like Levinshtein distance might be severe overkill for what you want, and it might give you funny results. If you just want to check for casing, then call string.lower() and make sure all of the strings in your set are lower case. You probably don't need something so fancy as a string similarity metric.

mattg
  • 1,731
  • 1
  • 12
  • 20