20

Say I have the following two strings in my database:

(1) 'Levi Watkins Learning Center - Alabama State University'
(2) 'ETH Library'

My software receives free text inputs from a data source, and it should match those free texts to the pre-defined strings in the database (the ones above).

For example, if the software gets the string 'Alabama University', it should recognize that this is more similar to (1) than it is to (2).

At first, I thought of using a well-known string metric like Levenshtein-Damerau or Trigrams, but this leads to unwanted results as you can see here:

http://fuzzy-string.com/Compare/Transform.aspx?r=Levi+Watkins+Learning+Center+-+Alabama+State+University&q=Alabama+University

http://fuzzy-string.com/Compare/Transform.aspx?r=ETH+Library&q=Alabama+University

Difference to (1): 37
Difference to (2): 14

(2) wins because it is much shorter than (1), even though (1) contains both words (Alabama and University) of the search string.

I also tried it with Trigrams (using the Javascript library fuzzySet), but I got similar results there.

Is there a string metric that would recognize the similarity of the search string to (1)?

Jonas Sourlier
  • 13,684
  • 16
  • 77
  • 148
  • 2
    In the end, we used an extension of the Trigram method. Instead of using trigrams only (`Ala, lab, aba, bam, ama` for `Alabama`), we also used the five-character equivalent (`Alaba, labam, abama`). We used some sort of weighted average between the trigram distance and the five-character-gram distance. This approach was sufficient for our needs, but I'm also eager to know a better solution. – Jonas Sourlier Apr 25 '14 at 13:16

6 Answers6

6

You could try the Word Mover's Distance https://github.com/mkusner/wmd instead. One brilliant advantage of this algorithm is that it incorporates the implied meanings while computing the differences between words in documents. The paper can be found here

Sander De Dycker
  • 16,053
  • 1
  • 35
  • 40
neaGaze
  • 1,381
  • 22
  • 28
4

You should change your approach:

levenshtein Distance is good at calculating similarities in units either they are 'characters' or 'words'.

Conceptually you are considering Alabama and university (2 words) as 2 units and you want to calculate the distance between the words for which levenshtein distance should mean how many words are in between Alabama and University which should be 1.

But, you are trying to apply levenshtein algorithm that is implemented for characters within a word. This implementation will only work for matching the single words NOT sentences.

Its better you should implement your own levenshtein algorithm (using BK-Tree) for 'words' on the top and within each match, you again match the each word using levenshtein for 'characters'.

your result for (1) should be a match with distance 1 with that algorithm and No match for (2).

4

I guess the answer is not required anymore, but i liked the question and it got me into thinking how to combine the advantages of RegEx plus Levenshtein string metric but be less dependent on the distance.

So far i have come up with a parser, that follows this premises and logic:

  • It uses Python3 and the regex module (OP didn't mention any language/module requirements)
  • Any needle that is searched for will be stripped from its punctuation characters
  • Every haystack is also stripped of its punctuation characters - So N.A.S.A would be NASA - like in the needle if it was originally N.A.S.A. - i know this can be problematic for quite some scenarios, but given the premises i couldn't come up with a better solution.
  • Every word within needle that is not at least 3 characters long will be removed (no need for is, on, at, no, etc.)
  • Matching is case-insensitive
  • The needle will be split into wordgroups containing n items: n is defined within a dict 0 < k <= l where k is the dict key
  • Each of the words within a wordgroup must follow each other, with a maximum distance of n words between them
  • Each word, depending on its n length, can have a different allowed error threshold: the errors in total, substitions, inserts and deletions can be specified, again with a dict holding the key where 0 < k <= n
  • Both previously mentioned dict's contain key/lambda pairs, which is useful for their last/first item to make calculations

Online demo here

contextual_fuzzy_matcher.py:

from collections import OrderedDict
import regex


class ContextualFuzzyMatcher(object):
    maximum_word_distance = 2
    word_distance = r"\s(?:[\w]+\s){{0,{}}}".format(maximum_word_distance)
    punctuation = regex.compile(r"[\u2000-\u206F\u2E00-\u2E7F\\'!\"#$%&\(\)\*\+,\-\.\/:;<=>\?@\[\]\^_`\{\|\}~]")
    groups = OrderedDict((
        (0, lambda l: l),
        (4, lambda l: 3),
        (8, lambda l: 6),
        (10, lambda l: l // 0.75),
    ))
    tolerances = OrderedDict((
        (0, {
            'e': lambda l: 0,
            's': lambda l: 0,
            'i': lambda l: 0,
            'd': lambda l: 0,
        }),
        (3, {
            'e': lambda l: 1,
            's': lambda l: 1,
            'i': lambda l: 1,
            'd': lambda l: 1,
        }),
        (6, {
            'e': lambda l: 2,
            's': lambda l: 1,
            'i': lambda l: 1,
            'd': lambda l: 1,
        }),
        (9, {
            'e': lambda l: 3,
            's': lambda l: 2,
            'i': lambda l: 2,
            'd': lambda l: 2,
        }),
        (12, {
            'e': lambda l: l // 4,
            's': lambda l: l // 6,
            'i': lambda l: l // 6,
            'd': lambda l: l // 6,
        }),
    ))

    def __init__(self, needle):
        self.sentence = needle
        self.words = self.sentence_to_words(sentence)
        self.words_len = len(self.words)
        self.group_size = self.get_group_size()
        self.word_groups = self.get_word_groups()
        self.regexp = self.get_regexp()

    def sentence_to_words(self, sentence):
        sentence = regex.sub(self.punctuation, "", sentence)
        sentence = regex.sub(" +", " ", sentence)
        return [word for word in sentence.split(' ') if len(word) > 2]

    def get_group_size(self):
        return list(value for key, value in self.groups.items() if self.words_len >= key)[-1](self.words_len)

    def get_word_groups(self):
        return [self.words[i:i + self.group_size] for i in range(self.words_len - self.group_size + 1)]

    def get_tolerance(self, word_len):
        return list(value for key, value in self.tolerances.items() if word_len >= key)[-1]

    def get_regexp(self):
        combinations = []
        for word_group in self.word_groups:
            distants = []
            for word in word_group:
                word_len = len(word)
                tolerance = self.get_tolerance(word_len)
                distants.append(r"({}){{e<={},s<={},i<={},d<={}}}".format(
                    word,
                    tolerance['e'](word_len),
                    tolerance['s'](word_len),
                    tolerance['i'](word_len),
                    tolerance['d'](word_len),
                ))
            combinations.append(
                self.word_distance.join(distants)
            )
        return regex.compile(
            r"|".join(combinations),
            regex.MULTILINE | regex.IGNORECASE
        )

    def findall(self, haystack):
        return self.regexp.findall(haystack)

main.py:

test_sentences = [
    'Levi Watkins Learning Center - Alabama State University',
    'ETH Library'
]
test_texts = [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sapien eget mi proin sed libero enim sed. Nec tincidunt praesent semper feugiat nibh sed pulvinar. Habitasse platea dictumst quisque sagittis. Tortor condimentum lacinia quis vel eros donec ac odio. Platea dictumst vestibulum rhoncus est pellentesque elit ullamcorper dignissim. Ultricies tristique nulla aliquet enim tortor at. Mi proin sed libero enim sed faucibus. Fames ac turpis egestas integer eget aliquet nibh. Potenti nullam ac tortor vitae purus faucibus ornare suspendisse. Cras semper auctor neque vitae tempus quam pellentesque nec. Quam lacus suspendisse faucibus interdum posuere. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl tincidunt. Viverra tellus in hac habitasse. Nibh nisl condimentum id venenatis a condimentum vitae. Tincidunt dui ut ornare lectus."
    "Mattis aliquam faucibus purus in massa tempor nec feugiat nisl. Amet consectetur adipiscing elit ut aliquam purus. Turpis massa tincidunt dui ut ornare. Suscipit tellus mauris a diam maecenas sed enim ut sem. Id consectetur purus ut faucibus pulvinar elementum. Est velit egestas dui id. Felis imperdiet proin fermentum leo. Faucibus nisl tincidunt eget nullam non nisi est sit. Elit pellentesque habitant morbi tristique. Nisi lacus sed viverra tellus. Morbi tristique senectus et netus et malesuada fames. Id diam vel quam elementum pulvinar. Id nibh tortor id aliquet lectus. Sem integer vitae justo eget magna. Quisque sagittis purus sit amet volutpat consequat. Auctor elit sed vulputate mi sit amet. Venenatis lectus magna fringilla urna porttitor rhoncus dolor purus. Adipiscing diam donec adipiscing tristique risus nec feugiat in fermentum. Bibendum est ultricies integer quis."
    "Interdum posuere lorem ipsum dolor sit. Convallis convallis tellus id interdum velit. Sollicitudin aliquam ultrices sagittis orci a scelerisque purus. Vel quam elementum pulvinar etiam. Adipiscing bibendum est ultricies integer quis. Tellus molestie nunc non blandit. Sit amet porttitor eget dolor morbi non arcu. Scelerisque purus semper eget duis at tellus. Diam maecenas sed enim ut sem viverra. Vulputate odio ut enim blandit volutpat maecenas. Faucibus purus in massa tempor nec. Bibendum ut tristique et egestas quis ipsum suspendisse. Ut aliquam purus sit amet luctus venenatis lectus magna. Ac placerat vestibulum lectus mauris ultrices eros in cursus turpis. Feugiat pretium nibh ipsum consequat nisl vel pretium. Elit pellentesque habitant morbi tristique senectus et.",
    "Found at ETH's own Library", # ' will be a problem - it adds one extra deletion
    "State University of Alabama has a learning center called Levi Watkins",
    "The ETH library is not to be confused with Alabama State university's Levi Watkins Learning center",
    "ETH Library",
    "Alabma State Unversity",
    "Levi Wtkins Learning"
]


for test_sentence in test_sentences:
    parser = ContextualFuzzyMatcher(test_sentence)
    for test_text in test_texts:
        for match in parser.findall(test_text):
            print(match)

returns:

('', '', '', '', '', '', '', '', '', '', '', '', ' Alabama', 'State', 'university')
(' Levi', 'Watkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', '', '', '', '', '', 'Alabma', 'State', 'Unversity')
('Levi', 'Wtkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
(' ETH', 'library')
('ETH', 'Library')

I am fully aware that this is far away from a perfect solution and that my examples were few and not really representative - but maybe by tweaking the configuration and doing a lot of real-world tests, it may be able to cover quite a lot of cases without generating too many false positives. Also since it is class-based, it can be inherited and configured differently for different sources - maybe in scientific texts a maximum word distance of 1 is sufficient, in newspaper articles maybe 3 are needed, and so on.

wiesion
  • 2,349
  • 12
  • 21
2

First, your distance score needs to be adjusted based on the length of the database entry and/or input. A distance of 5 against an expression of 10 characters is much worse than a distance of 5 against an expression of 100 characters.

But the main problem with your approach is that plain Levenshtein is not a substring matching algorithm. It compares all of one string with all of another string. Your big distance in case (1) is due to the large number of words in the database expression that are not in the input expression.

To get around that you are better off using an algorithm that can match substrings such as Fuzzy Bitap or Smith–Waterman.

If you have to use Levenshtein or similar you probably want to use it to compare words to words and then generate some score based on the number of matching words and the quality of the matches.

rghome
  • 8,529
  • 8
  • 43
  • 62
1

You can try to use normalized levenshtein distance:

Li Yujian, Liu Bo, "A Normalized Levenshtein Distance Metric," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091-1095, June 2007, doi:10.1109/TPAMI.2007.1078 http://www.computer.org/csdl/trans/tp/2007/06/i1091-abs.html

They propose to normalize the levenshtein distance. By doing this, a difference of one character in a sequences of longer two weights more than the same difference when comparing sequences of longer 10.

Matthias Studer
  • 1,722
  • 1
  • 10
  • 24
-1

Keyword Counting

You haven't really defined why you think option one is a "closer" match, at least not in any algorithmic sense. It seems like you're basing your expectations on the notion that option one has more matching keywords than option two, so why not just match based on the number of keywords in each string?

For example, using Ruby 2.0:

string1 = 'Levi Watkins Learning Center - Alabama State University'
string2 = 'ETH Library'
strings = [str1, str2]

keywords  = 'Alabama University'.split
keycount  = {}

# Count matching keywords in each string.
strings.each do |str|
  keyword_hits  = Hash.new(0)
  keywords.each { |word| keyword_hits[word] += str.scan(/#{word}/).count }
  keyword_count = keyword_hits.values.reduce :+
  keycount[str] =  keyword_count
end

# Sort by keyword count, and print results.
keycount.sort.reverse.map { |e| pp "#{e.last}: #{e.first}" }

This will print:

"2: Levi Watkins Learning Center - Alabama State University"
"0: ETH Library"

which matches your expectations of the corpus. You might want to make additional passes on the results using other algorithms to refine the results or to break ties, but this should at least get you pointed in the right direction.

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
  • Thank you. The problem with your approach is that I'm losing the biggest advantage of the Levenshtein algorithm: **detecting the similarity of words**. Keyword Counting, as you're suggesting it, wouldn't detect the similarity of misspelled words (for example `University` and `Univresity`). – Jonas Sourlier Nov 23 '13 at 14:27
  • 2
    @cheeesus Your entire approach, as originally posted, is not looking at words but at whole strings. You have also not defined any meaningful metric for determining similarity according to whatever definitions you want. If you want a better answer, you will need to improve your question. The answer I posted works for your corpus; if you want other results, please post a different corpus and different sample output. – Todd A. Jacobs Dec 04 '13 at 15:45