I've been using the [Simmetrics][1] Java library with good success for comparing two Strings with good success. But there seem to be two approaches and I need a combination of both for my scenario.
Currently I am using CosineSimilarity (I do use some simplifiers but have omitted here to keep code simple)
StringMetric metric = with(new CosineSimilarity<String>())
.tokenize(Tokenizers.whitespace()).build();
score = metric.compare(string1, string2);
and this works quite well except I when there is a simple misspelling I would have expected a higher score than I get
e.g comparing mony honey and money honey only returns 0.5 (scores go from 0.0 to 1.0 with 1.0 being perfect match), I would have expected higher.
With Levenshtein it returns a better 0.9090909
But one thing I noted reading the documentation was that this is a MultiSet metric, and that the whitespace() is actually required to break the input into parts, whereas a StringMetric such as Levenshtein does not
StringMetric metric = with(new Levenshtein())
.build();
This then implies do me that Levenshtein doesnt consider whitespace specially which is an issue as I want to match words and essentially ignore the whitespace or order.
so for example using CosineSimilarity it returns 1.0 when comparing honey trap and trap honey but Levenshtein return 0.0, that is no good for me.
What I ideally want is word order to not be important, and then for individual words to be a good match if there are just slight variations in the word e.g money/mony
The Strings can be in any language, but are most usually in English, they are song titles so are usually less than ten words long, typically about 5 words long.
Does Simmetrics offer another algorithm that can provide both these parts ?
There are simplifiers such as RefinedSoundex that could be applied to input, but because the language may not be in English dont think that would work very well.
What do you think would be the best algorithm to use ?