Are there any string comparison alogrithms out there that are "better" than Levenshtein Distance?

Question

I have been using it for a project I am working on, but some of the results aren't what I would choose. For example:

When "Date" is compared to

"State" it has a lev distance of 2
"Today's Date" it has a lev distance of 9

This is what we would expect from the algorithm of course, but I'm curious if anyone knows of something out there that will give a closer match to any compared strings that have an exact match of the source string (Date)? Meaning that "Today's Date" would have a higher ranking because it has "Date" in it.

Bonus points if you can find a .NET library that implements this.

@Lie, I thought I did. Did you read the question or just the title? Take a look at the second to last paragraph for an explanation on what I mean by "better" — Abe Miessler, Feb 13 '11 at 16:28
You can simply word-match the string and modify it's ranking according to the results? Also check this question http://stackoverflow.com/questions/1510593/jaro-winkler-distance-algorithm-in-net — Jaroslav Jandek, Feb 13 '11 at 16:35
@Abe Miessler: giving one example is not sufficient, you should clearly define why a string should be considered closer match than another; or if you do not know that, can you explain what problems you are trying to solve? Levenshtein Edit Distance gives you the minimum number of single-character transformations (insert, delete, and change) required to transform one string to another string. It is "best" in one sense, but obviously is not what you're looking for. — Lie Ryan, Feb 13 '11 at 16:37
You could for instance take the minimum of the Levenshtein distances of your word against multiple synonyms. For instance {"date","today's date","date of today", ...}. It could potentially be a lot of work to define synonyms if you need to match many different words, though. — rettvest, Feb 13 '11 at 17:40

score 2 · Answer 1 · answered Feb 13 '11 at 16:28

2

You probably wanted to find Longest common subsequence?

answered Feb 13 '11 at 16:28

Lie Ryan

62,238
13
100
144

Johan Sjöberg · Accepted Answer · 2011-02-13T16:38:10.600

1

I think it's meant for you to tokenize the word before employing Levenshtein. As an alternative there is Jaro-Winker distance too.

There's a .net library SimMetrics which seems to cover a few alternatives.

edited Feb 13 '11 at 16:38

answered Feb 13 '11 at 16:26

Johan Sjöberg

47,929
21
130
148

Interesting, so you're saying compare each individual word in the compared value? Would I then use the lowest score from any of the words in the compared value? – Abe Miessler Feb 13 '11 at 16:30
@Abe - it would depend on the problem whether comparing each token and selecting the lowest match is appropriate. Essentially doing this amounts to a fuzzy "contains" search. I do this when comparing building names for campus buildings where it seems natural that a person would enter just the main part of the name but I would want to match based on the full name. What I do is tokenize both the source and target strings, find the lowest distances for each token in the source to each target token, and sum them. – tvanfosson Feb 13 '11 at 16:35
It depends on your application, if you want to use it to provide spelling suggestions for one or more words I think you should compare them individually; and the same goes for if you wish to see the likelihood of word `a` being in sentence `b`. But as I said, it depends on what you need the distance for. – Johan Sjöberg Feb 13 '11 at 16:35

score 0 · Answer 3 · answered Feb 13 '11 at 17:24

To do it properly you need some context of the use

If you trying to do an address lookup then "Nosuch STREET" might have a perfect match of "Nosuch ROAD", or in a no-fly list you want all 20 spelling of Gadaffi to match.

if you are trying to analyse how much a piece of historic text has changed with copying then you need a different algorith,

Are there any string comparison alogrithms out there that are "better" than Levenshtein Distance?

3 Answers3