Similarity String Comparison in Java

Question

I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:

"The quick fox jumped" -> "The fox jumped"
"The quick fox jumped" -> "The fox"

This comparison would return that the first is more similar than the second.

I guess I need some method such as:

double similarityIndex(String s1, String s2)

Is there such a thing somewhere?

EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work

score 190 · Answer 1 · edited Nov 25 '17 at 07:10

The common way of calculating the similarity between two strings in a 0%-100% fashion, as used in many libraries, is to measure how much (in %) you'd have to change the longer string to turn it into the shorter:

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

Computing the `editDistance()`:

The editDistance() function above is expected to calculate the edit distance between the two strings. There are several implementations to this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithm and we'll use it in our example below (for very large strings, other algorithms are likely to perform better).

Here's two options to calculate the edit distance:

You can use Apache Commons Text's implementation of Levenshtein distance: apply(CharSequence left, CharSequence rightt)
Implement it in your own. Below you'll find an example implementation.

Working example:

See online demo here.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

Output:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

Levenshtein distance method is available in `org.apache.commons.lang3.StringUtils`. — Cleankod, Dec 05 '14 at 08:55
@Cleankod Now it is part of commons-text: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/LevenshteinDistance.html — Luiz, Nov 13 '19 at 13:41

score 91 · Accepted Answer · edited Jan 08 '20 at 13:24

91

Yes, there are many well documented algorithms like:

Cosine similarity
Jaccard similarity
Dice's coefficient
Matching similarity
Overlap similarity
etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

edited Jan 08 '20 at 13:24

xeruf

2,602
1
25
48

answered Jun 05 '09 at 09:59

dfa

114,442
31
189
228

20

+1 The simmetrics site doesn't seem active anymore. However, I found the code on sourceforge: http://sourceforge.net/projects/simmetrics/ Thanks for the pointer. – Michael Merchant Dec 22 '11 at 21:06
7

The "you can check this" link is broken. – Kiril Mar 18 '14 at 14:39
2

That's why Michael Merchant posted the correct link above. – emilyk Sep 24 '14 at 20:24
2

The jar for simmetrics on sourceforge is a bit outdated, https://github.com/mpkorstanje/simmetrics is the updated github page with maven artifacts – tom91136 Apr 11 '15 at 03:32
To add to @MichaelMerchant 's comment, the project is also available on [github](https://github.com/Simmetrics/simmetrics). Not very active there either though but a bit more recent than sourceforge. – Ghurdyl Dec 12 '18 at 11:26

score 16 · Answer 3 · edited Jul 12 '13 at 03:31

I translated the Levenshtein distance algorithm into JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

score 14 · Answer 4 · answered Aug 07 '15 at 11:26

There are indeed a lot of string similarity measures out there:

Levenshtein edit distance;
Damerau-Levenshtein distance;
Jaro-Winkler similarity;
Longest Common Subsequence edit distance;
Q-Gram (Ukkonen);
n-Gram distance (Kondrak);
Jaccard index;
Sorensen-Dice coefficient;
Cosine similarity;
...

You can find explanation and java implementation of these here: https://github.com/tdebatty/java-string-similarity

noelicus · Answer 5 · 2022-10-19T11:52:28.080

12

You can achieve this using the apache commons text library. Take a look at these two classes within it:

Deprecated version of the above:

apache commons java library -> getLevenshteinDistance getFuzzyDistance

edited Oct 19 '22 at 11:52

answered Apr 10 '17 at 21:17

noelicus

14,468
3
92
111

3

As of october 2017, the linked methods are deprecated. Use the classes [LevenshteinDistance](https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/LevenshteinDistance.html) and [FuzzyScore](https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/FuzzyScore.html) from the [commons text library](https://commons.apache.org/proper/commons-text/) instead – vatbub Oct 09 '17 at 21:04

score 11 · Answer 6 · answered Jun 05 '09 at 09:58

11

You could use Levenshtein distance to calculate the difference between two strings. http://en.wikipedia.org/wiki/Levenshtein_distance

answered Jun 05 '09 at 09:58

Florian Fankhauser

3,615
2
26
30

2

Levenshtein is great for a few strings, but will not scale to comparisons between a large number of strings. – spender Jun 05 '09 at 10:00
I've used Levenshtein in Java with some success. I havent done comparisons over huge lists so there may be a performance hit. Also it's a bit simple and could use some tweaking to raise the threshold for shorter words (like 3 or 4 chars) which tend to be seen as more similar than the should (it's only 3 edits from cat to dog) Note that the Edit Distances suggested below are pretty much the same thing - Levenshtein is a particular implementation of edit distances. – Rhubarb Jun 05 '09 at 10:27
Here's an article showing how combine Levenshtein with an efficient SQL query: http://literatejava.com/sql/fuzzy-string-search-sql/ – Thomas W Apr 26 '14 at 02:11

score 5 · Answer 7 · answered Oct 18 '14 at 13:09

Thank to the first answerer, I think there are 2 calculations of computeEditDistance(s1, s2). Due to high time spending of it, decided to improve the code's performance. So:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

score 3 · Answer 8 · answered Jun 05 '09 at 09:59

3

Theoretically, you can compare edit distances.

answered Jun 05 '09 at 09:59

Anton Gogolev

113,561
39
200
288

score 3 · Answer 9 · answered Jun 05 '09 at 10:00

3

This is typically done using an edit distance measure. Searching for "edit distance java" turns up a number of libraries, like this one.

answered Jun 05 '09 at 10:00

Laurence Gonsalves

137,896
35
246
299

score 3 · Answer 10 · answered Jun 05 '09 at 10:01

Sounds like a plagiarism finder to me if your string turns into a document. Maybe searching with that term will turn up something good.

"Programming Collective Intelligence" has a chapter on determining whether two documents are similar. The code is in Python, but it's clean and easy to port.

score 0 · Answer 11 · answered Aug 14 '22 at 17:21

You can use this "Levenshtein Distance" algorithm without any library:

 public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
    if (s == null || t == null) {throw new IllegalArgumentException("Strings must not be null");}
    int n = s.length();
    int m = t.length();

    if (n == 0) {
            return m;
        }
    else if (m == 0) {
            return n;
        }

    if (n > m) {
            // swap the input strings to consume less memory
            final CharSequence tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

    final int[] p = new int[n + 1];
    // indexes into strings s and t
    int i; // iterates through s
    int j; // iterates through t
    int upper_left;
    int upper;

    char t_j; // jth character of t
    int cost;

    for (i = 0; i <= n; i++) {
            p[i] = i;
        }

    for (j = 1; j <= m; j++) {
            upper_left = p[0];
            t_j = t.charAt(j - 1);
            p[0] = j;

            for (i = 1; i <= n; i++) {
                    upper = p[i];
                    cost = s.charAt(i - 1) == t_j ? 0 : 1;
                    // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                    p[i] = Math.min(Math.min(p[i - 1] + 1, p[i] + 1), upper_left + cost);
                    upper_left = upper;
                }
        }

    return p[n];
   }

From Here

score -1 · Answer 12 · answered May 10 '20 at 09:11

-1

You can also use z algorithm to find similarity in the string. Click here https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/

answered May 10 '20 at 09:11

Athul Samuel

1
1

Similarity String Comparison in Java

12 Answers12

Computing the `editDistance()`:

Working example:

Linked

Related

Similarity String Comparison in Java

12 Answers12

Computing the editDistance():

Working example:

Linked

Related

Computing the `editDistance()`: