1

I've been using the [Simmetrics][1] Java library with good success for comparing two Strings with good success. But there seem to be two approaches and I need a combination of both for my scenario.

Currently I am using CosineSimilarity (I do use some simplifiers but have omitted here to keep code simple)

StringMetric metric = with(new CosineSimilarity<String>())
                .tokenize(Tokenizers.whitespace()).build();
 score = metric.compare(string1, string2);

and this works quite well except I when there is a simple misspelling I would have expected a higher score than I get

e.g comparing mony honey and money honey only returns 0.5 (scores go from 0.0 to 1.0 with 1.0 being perfect match), I would have expected higher.

With Levenshtein it returns a better 0.9090909

But one thing I noted reading the documentation was that this is a MultiSet metric, and that the whitespace() is actually required to break the input into parts, whereas a StringMetric such as Levenshtein does not

 StringMetric metric = with(new Levenshtein())
                .build();

This then implies do me that Levenshtein doesnt consider whitespace specially which is an issue as I want to match words and essentially ignore the whitespace or order.

so for example using CosineSimilarity it returns 1.0 when comparing honey trap and trap honey but Levenshtein return 0.0, that is no good for me.

What I ideally want is word order to not be important, and then for individual words to be a good match if there are just slight variations in the word e.g money/mony

The Strings can be in any language, but are most usually in English, they are song titles so are usually less than ten words long, typically about 5 words long.

Does Simmetrics offer another algorithm that can provide both these parts ?

There are simplifiers such as RefinedSoundex that could be applied to input, but because the language may not be in English dont think that would work very well.

What do you think would be the best algorithm to use ?

Paul Taylor
  • 13,411
  • 42
  • 184
  • 351

1 Answers1

0

Simmetrics contains metrics for comparing Strings, Lists, Sets and Multisets.

The Levenshtein distance between two words is the minimum number of single-character edits. Whitespace is a character too so a difference in whitespace causes a difference in similarity.

Cosine similarity is the similarity between two zero vectors (which for convenience are presented as multisets). So without some form of processing cosine similarity is simply not suited to compare strings.

Depending on how you split the string you may end up comparing entirely different things. If you split the string on whitespace you will end up comparing documents by their similarity in word usage. If you split the string on n-grams you'll compare strings on their letter pairs which tends to work well against typos.

For your particular use case you may want to look into tokenizing on whitespace and then tokenizing on q-grams. Then try either CosineSimilarity,Tanimoto, Dice, SimonWhite, Jaccard.

E.g:

/**
 * Tokenizers can also be chained.
 * 
 * `chilperic ii son of childeric ii`
 * 
 * By splitting on whitespace is tokenized into:
 * 
 * `[chilperic, ii, son, of, childeric, ii]`
 * 
 * After using a q-gram with a q of 2:
 * 
 * `[ch,hi,il,il,lp,pe,er,ri,ic, ii, so,on, of, ch,hi,il,ld,de,er,ri,ic,
 * ii]`
 * 
 */
public static float example04() {

    String a = "A quirky thing it is. This is a sentence.";
    String b = "This sentence is similar; a quirky thing it is.";

    StringMetric metric = 
            with(new CosineSimilarity<String>())
            .tokenize(Tokenizers.whitespace())
            .tokenize(Tokenizers.qGram(3))
            .build();

    return metric.compare(a, b); // 0.8292
}

To make your decision you can take a number of representative queries and compare the results by their precision and recall. Then you can make a good decision on which metric to use.

Full disclosure: I'm the current maintainer of Simmetrics Project.

M.P. Korstanje
  • 10,426
  • 3
  • 36
  • 58
  • So I do use various simplifiers but CosineSimilarity basically just works out a measure what proportion of tokens in Set1 are in Set2, does it consider order at all (I assume not), what I cant grasp is even after reading the javadoc how it would differ from using Simon White, Dice ectera. – Paul Taylor Nov 23 '16 at 11:04
  • I will give ngrams a go, I think the comments in your example are code incorrect but I assume applying after whitespace tokenizer means we are comparing pairs of letters and ignoring any ' ' in that. With Lucene it understood words , using whitespace tokenizer here keeps words but is there no way to use Levenshein on a word basis. Using a StringMetric doesn't allow tokenization which is counter intuitive to me since a StringMetric would seem to be the obvious way to compare two strings – Paul Taylor Nov 23 '16 at 11:15
  • Cosine similarity does not consider order, nor do any of the other *set metrics. The difference between Dice<->SimonWhite and between Tanimoto<->CosineSimilarity is that SimonWhite and Cosine do not ignored duplicated elements. This follows from their nature as multi-set metrics. – M.P. Korstanje Nov 23 '16 at 19:58
  • Yes. Tokenizing on whitespace and then taking n-grams ensures that tokenizing "Hello World" does not contain "o " or " W" tokens. – M.P. Korstanje Nov 23 '16 at 20:01
  • Unlike various other similarity libraries, I've chosen to open up the algorithms to the fundamental data type that they operate on. If you look at the implementation of Cosine similarity in Lucene you'll find that tokenization is done for you. I did not want to limit the users of Simmetrics to Strings alone. However this does require the user to have some understanding of how different metrics behave in their domain and what they should operate on or no choice can be made. Hence I generally advice people to look at the precision and recall of their chosen setup. – M.P. Korstanje Nov 23 '16 at 20:18
  • Precision/Recall seems to be more of a lucene type thing, i.,e searching thousands of docs for some words whereas simmetrics is just for comparing two things to each other (usually Strings) so I dont really see how it is relevant. What Im struggling with these different metrics is what they are good for, okay so neither SImonWhite and CosineSImilarity ignore duplicates, but how are different. The code shows me the algorithm but it doesnt explain the purpose of the difference – Paul Taylor Nov 24 '16 at 09:37
  • For most metrics the justification is empirical -it works- rather then theoretical. The explanation your looking for may not exist and will be domain specific. This makes it ill suited for use in technical documentation. I would recommend checking the linked wiki pages and sources on wiki. – M.P. Korstanje Nov 24 '16 at 09:52
  • hmm, okay thanks for your help Ive marked your answer correct as it helps. – Paul Taylor Nov 24 '16 at 10:14
  • However I would ague that most of this logic im discussing is not that domain specific or at least not a very specific domain, i.e the logic Im discussing is 'comparing two short sentences, allowing for typos and differences in word order' – Paul Taylor Nov 24 '16 at 10:30
  • I don't agree with the idea that this is not very specific. But if you're willing to write a short guide on how to do just that I'll accept it as a pull request. – M.P. Korstanje Nov 24 '16 at 19:27
  • 1
    FYI i found whitespace/ngrams2 didnt work so well with short sentences because sometimes it found quite high similarity between completely different sentences because they contained words that happened to have similar letters in it, and of course the order of the pairs has no effect. whitespace/ngram3 filters out most of these false matches and is working quite well – Paul Taylor Dec 01 '16 at 09:30