String likeness algorithms

Question

I have two strings (they're going to be descriptions in a simple database eventually), let's say they're

String A: "Apple orange coconut lime jimmy buffet"
String B: "Car bicycle skateboard"

What I'm looking for is this. I want a function that will have the input "cocnut", and have the output be "String A"

We could have differences in capitalization, and the spelling won't always be spot on. The goal is a 'quick and dirty' search if you will.

Are there any .net (or third party), or recommend 'likeness algorithms' for strings, so I could check that the input has a 'pretty close fragment' and return it? My database is going to have liek 50 entries, tops.

I'm trying the levenshtein algorithms right now. I guess I'm looking for a recommendation because my goal is to only use fragments of the entire string. Trying all of them and picking the best is probably what I should go for. — greggorob64, Mar 08 '13 at 20:57
http://stackoverflow.com/a/1095806/3043 ... link in that answer is dead, get the product here: http://sourceforge.net/projects/simmetrics/files/ — Joel Coehoorn, Mar 08 '13 at 20:58
@KonradRudolph: Looks like they have an (old) .NET version there: http://sourceforge.net/projects/simmetrics/files/simmetrics.NET%20and%20amp_%20phonetics.NET/ — Chris Sinclair, Mar 08 '13 at 21:11

score 12 · Accepted Answer · edited May 23 '17 at 12:34

12

What you’re searching for is known as the edit distance between two strings. There exist plenty of implementations – here’s one from Stack Overflow itself.

Since you’re searching for only part of a string what you want is a locally optimal match rather than a global match as computed by this method.

This is known as the local alignment problem and once again it’s easily solvable by an almost identical algorithm – the only thing that changes is the initialisation (we don’t penalise whatever comes before the search string) and the selection of the optimum value (we don’t penalise whatever comes after the search string).

edited May 23 '17 at 12:34

Community

1
1

answered Mar 08 '13 at 20:57

Konrad Rudolph

530,221
131
937
1,214

I think I figured out my solution, I'm going to use the levenshtien algorithm. Since most of my data is simple and space seperated, i'll just compare my string to a space-seperated version of the database entries, and take the highest words as the result. – greggorob64 Mar 08 '13 at 21:33

String likeness algorithms

1 Answers1