4

I have been using it for a project I am working on, but some of the results aren't what I would choose. For example:

When "Date" is compared to

  1. "State" it has a lev distance of 2
  2. "Today's Date" it has a lev distance of 9

This is what we would expect from the algorithm of course, but I'm curious if anyone knows of something out there that will give a closer match to any compared strings that have an exact match of the source string (Date)? Meaning that "Today's Date" would have a higher ranking because it has "Date" in it.

Bonus points if you can find a .NET library that implements this.

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
  • What kind of Bonus points we talking? – The Scrum Meister Feb 13 '11 at 16:25
  • 2
    @Scrum, I'll come wash your car. – Abe Miessler Feb 13 '11 at 16:28
  • @Lie, I thought I did. Did you read the question or just the title? Take a look at the second to last paragraph for an explanation on what I mean by "better" – Abe Miessler Feb 13 '11 at 16:28
  • You can simply word-match the string and modify it's ranking according to the results? Also check this question http://stackoverflow.com/questions/1510593/jaro-winkler-distance-algorithm-in-net – Jaroslav Jandek Feb 13 '11 at 16:35
  • 1
    @Abe Miessler: giving one example is not sufficient, you should clearly define why a string should be considered closer match than another; or if you do not know that, can you explain what problems you are trying to solve? Levenshtein Edit Distance gives you the minimum number of single-character transformations (insert, delete, and change) required to transform one string to another string. It is "best" in one sense, but obviously is not what you're looking for. – Lie Ryan Feb 13 '11 at 16:37
  • You could for instance take the minimum of the Levenshtein distances of your word against multiple synonyms. For instance {"date","today's date","date of today", ...}. It could potentially be a lot of work to define synonyms if you need to match many different words, though. – rettvest Feb 13 '11 at 17:40

3 Answers3

2

You probably wanted to find Longest common subsequence?

Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
1

I think it's meant for you to tokenize the word before employing Levenshtein. As an alternative there is Jaro-Winker distance too.

There's a .net library SimMetrics which seems to cover a few alternatives.

Johan Sjöberg
  • 47,929
  • 21
  • 130
  • 148
  • Interesting, so you're saying compare each individual word in the compared value? Would I then use the lowest score from any of the words in the compared value? – Abe Miessler Feb 13 '11 at 16:30
  • @Abe - it would depend on the problem whether comparing each token and selecting the lowest match is appropriate. Essentially doing this amounts to a fuzzy "contains" search. I do this when comparing building names for campus buildings where it seems natural that a person would enter just the main part of the name but I would want to match based on the full name. What I do is tokenize both the source and target strings, find the lowest distances for each token in the source to each target token, and sum them. – tvanfosson Feb 13 '11 at 16:35
  • It depends on your application, if you want to use it to provide spelling suggestions for one or more words I think you should compare them individually; and the same goes for if you wish to see the likelihood of word `a` being in sentence `b`. But as I said, it depends on what you need the distance for. – Johan Sjöberg Feb 13 '11 at 16:35
0

To do it properly you need some context of the use

If you trying to do an address lookup then "Nosuch STREET" might have a perfect match of "Nosuch ROAD", or in a no-fly list you want all 20 spelling of Gadaffi to match.

if you are trying to analyse how much a piece of historic text has changed with copying then you need a different algorith,

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263