How can I measure string similarity between sentences?

Question

I have the following task.

Given is a list of strings like so:

        var strings = [
            'Steve jobs created the iPod when he was at Apple',
            'I really like the new Macbook by Apple',
            'Jony Ive was concerned being fired by Steve Jobs after his return to Apple',
            'The new Macbook has just one USB-C type connector',
            'I like bananas',
            'The brezels I can buy in my local store are much better than the ones in the supermarket',
            'the',
            'foo',
            'Steve'
        ];

I now want to compare each string with each other and for each comparison I want to find out how similar they are to each other on a scale from 0-1 (or 0%-100%).

So, I googled around a little and found this: Similarity String Comparison in Java

So, I followed the instruction there, and ported the method similarity(String s1, String s2) to JavaScript:

        function similarity(s1, s2) {
            var longer = s1;
            var shorter = s2;
            if (s1.length < s2.length) {
                longer = s2;
                shorter = s1;
            }
            var longerLength = longer.length;
            if (longerLength == 0) {
                return 1.0;
            }
            return (longerLength - longer.LevenshteinDistance(shorter)) / longerLength;
        }

As comparison algorithm I used Levenshtein:

        String.prototype.LevenshteinDistance = function (s2) {
            var array = new Array(this.length + 1);
            for (var i = 0; i < this.length + 1; i++)
                array[i] = new Array(s2.length + 1);

            for (var i = 0; i < this.length + 1; i++)
                array[i][0] = i;
            for (var j = 0; j < s2.length + 1; j++)
                array[0][j] = j;

            for (var i = 1; i < this.length + 1; i++) {
                for (var j = 1; j < s2.length + 1; j++) {
                    if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
                    else {
                        array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                        array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
                    }
                }
            }
            return array[this.length][s2.length];
        };

So, as a test I ran a full loop comparing each string with each other and printing the result like this:

            for (var i in strings){
                var s = strings[i];
                print('Checking string: "' + s + '"');
                for (var j in strings){
                    print('-----');
                    var s2 = strings[j];
                    print('vs "' + s2 + '"');
                    var sim = similarity(s, s2);
                    print('Similarity: ' + Math.round(sim*100) + '%');
                }
                print('<br>////// NEXT /////////////////////////////////////////////////<br>');
            }

Ok, now this is the result: https://jsfiddle.net/wxksfa4w/

Now, looking at the results I get some good matches but also some that are completely unrelated to each other, for example:

"Steve jobs created the iPod when he was at Apple" and "I like bananas" match for 13%?

"Steve jobs created the iPod when he was at Apple" and just "Steve" match for just 10% although the exact same word "Steve" is used as in the first sentence?

How can I get better semantic results? Is Levenshtein the wrong algorithm? From what I understood is that Levenshtein counts the numbers of steps of how to change sentence 1 to sentence 2. So, the length of a string seems to have a major impact on the result even if there is semantic similarity.

Any advice?

With http://en.wikipedia.org/wiki/Levenshtein_distance description, I gotta believe there is an error in your use of the algorithm. I'd rather see you use a traditional definition of the function (with two (four) inputs) than what you've done with the global array. Read the wiki closely... — zipzit, Mar 29 '15 at 02:32
How set are you on doing this client-side? A javascript request along with PHP's `similar_text` may fit the bill. — John Smith, Mar 29 '15 at 02:35
@JohnSmith Nothing wrong with that. I fear that similar_text will give similar results like Levenshtein after checking how it works. — Timo Ernst, Mar 29 '15 at 10:01

score 1 · Accepted Answer · answered Mar 29 '15 at 02:24

1

You should probably be using the presence of words in both sentences as a high hint of similarity. A simple way of doing it is using each sentence as a bag of words and using tf-idf

answered Mar 29 '15 at 02:24

kar288

131
3

Agreed. Levenshtein distance is intended to compare words, not sentences. You might get nice (and also very weird) results by using Levenshtein to compare tokens once you have tokenized your sentences, but for that first step you'll need a regex-based or more complicated tokenizer. – Touffy Mar 29 '15 at 06:55
kar288 and @Touffy: Ok, if I change to word-based comparison: What if I had s1="I'm going home" and s2="I went home"? The words "went" and "going" are completely different in terms of characterwise comparison but their meaning is similar. Thus I'd want a higher similarity rating value. Same with "I eat" and "I am eating". – Timo Ernst Mar 29 '15 at 09:56
2

Then you need to delve deeper into NLP territory: you'll have to lemmatize the words (using a dictionnary+morphology to replace words with their base form). – Touffy Mar 29 '15 at 12:08
1

Yeah, like Touffy you should do some more NLP. Something basic that you can do and could help a lot would be POS tagging. There are a lot of libraries out there. I've only done this in java with the stanford libraries but maybe you can find a js version ( https://www.npmjs.com/package/pos ) – kar288 Mar 31 '15 at 03:45
@kar288 So, I could use a pos tagger like this https://code.google.com/p/jspos/ and get a tagged version of my sentences. Then I apply word-based comparison to this instead of sentence-based comparison on the result that jspos returns? – Timo Ernst Apr 21 '15 at 20:01
@Touffy Can you recommend a better word-based comparison algorithm thats superior to Levenshtein and suits my needs better? – Timo Ernst Apr 21 '15 at 20:02
@Timo I can recommend plenty, but I don't know what your needs and requirements are. You can work with tokens alone, or with processed tokens (e.g. stems, lemmas, word vectors…), you may be interested in their order or not, you can try to parse the grammatical structure of the sentence and then use tree comparison algorithms… the most basic solution is what kar288 answered. – Touffy Apr 21 '15 at 20:10
@Touffy So following kar288's suggestion, I would use a pos tagger like this code.google.com/p/jspos and get a tagged version of my sentences. Then I apply word-based comparison to this instead of sentence-based comparison on the result that jspos returns? – Timo Ernst Apr 21 '15 at 20:31
1

Sorry, I meant his first suggestion (his answer, not his comment). Parts Of Speech (POS) are grammatical features that may or may not be of interest to you. It's the tokenization that you need (turning a string into a list of words). – Touffy Apr 21 '15 at 20:40
(but POS would help with lemmatization or stemming, if you want to do that) – Touffy Apr 21 '15 at 20:41

score 0 · Answer 2 · answered Aug 07 '15 at 13:56

What you could use is normalized Longest Common Subsequence (LCS) similarity: you compute the length of the longest common subsequence, then you divide by the length of the smallest string.

By the way, the longest common subsequence should not be confused with the longest common substring: for the two strings "This is a long string" and "This is another string, really..."

The longest common subsequence is "This is a string"
The longest common substring is "This is a"

And the relative LCS similarity is 16/21 = 0.76

You can find a Java implementation of LCS similarity here: https://github.com/tdebatty/java-string-similarity

And a Javascript implementation is available on wikibooks: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_subsequence#JavaScript

score 0 · Answer 3 · edited May 23 '17 at 10:30

0

SimMetrics has java code for the Smith Waterman Gotoh algorithm which is great for comparing string sentences. I've found Smith Waterman Gotoh to be the superior algorithm for comparing larger strings such as sentences and article titles.

edited May 23 '17 at 10:30

Community

1
1

answered Jul 07 '16 at 01:59

joshweir

5,427
3
39
59

How can I measure string similarity between sentences?

3 Answers3