0

I've did some research, and I've found topics going off checking if a string is a substring in a string, and choosing the closest string to the specified string, but how would I check if one string is similar to another and provide a true/false response? ie:

String 1: JAVA IS A PROGRAMMING LANGUAGE
String 2: JAVA IS A PROGRAMMING LANGUAG X

This would return a "true"

String 1: JAVA IS A PROGRAMMING LANGUAGE
String 2: I ATE THE CAKE

This would return "false"

Thanks.

Sophia Zhuang
  • 29
  • 1
  • 3
  • You need to learn basics of java. – afzalex Oct 19 '14 at 01:32
  • 3
    Please refer this article. http://stackoverflow.com/questions/955110/similarity-string-comparison-in-java – bhugo313 Oct 19 '14 at 01:33
  • @HugoBauer: Consider flagging this question as a duplicate of that one (if you think it is). – Jeffrey Bosboom Oct 19 '14 at 01:34
  • There's no general way to provide a true/false result here -- you'll have to use one of the edit distance metrics and set a threshold below which you consider two strings to be similar. – Jeffrey Bosboom Oct 19 '14 at 01:35
  • 3
    @afzalex: This doesn't have anything to do with equals() -- the poster explicitly wants to know if strings are _similar_, not _equal_. – Jeffrey Bosboom Oct 19 '14 at 01:36
  • I was mainly thinking of providing a verification level, a integer from 1-10 or something, the lower it is, the more strict the verification, and the higher it is, the more lenient. – Sophia Zhuang Oct 19 '14 at 01:50

3 Answers3

5

What you're asking is a bit non-trivial. The core of your answer, is another question:

How do you define "similar"?

You need to specify some rules governing this, and some threshold associated with the rules, which I'm not sure you have even thought about yet.

For instance, below is a simple solution (Go easy on me on the pretty-ness or efficiency, I just threw this together really fast so the code may be a bit messy...I was more concerned with answering the question...you can refactor it yourself if you want). There is a threshold of %75, in which I check to see if the number of characters in the smaller string matches up to %75 percent of the larger string (Note: java.lang.String is final, so you cannot extend it):

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.logging.Level;
import java.util.logging.Logger;

public class MyString{
  private static final float THRESHOLD = (float) 0.75;

  private final Logger logger = Logger.getLogger(MyString.class.getName());

  private String str;
  private Map <Character, Integer> strMap;

  public MyString(String str){ //java.lang.String is final...
    this.str = str;
    this.strMap = this.generateCharMap(str);
  }

  public void executeTestForSophiaZhuang(){
    {
      MyString str1 = new MyString("JAVA IS A PROGRAMMING LANGUAGE");
      String str2 = "JAVA IS A PROGRAMMING LANGUAG X";
      logger.log(Level.INFO, "String {0}.isSimilar({1}) == {2}", new Object[]{
        str1.toString(), str2, str1.isSimilar(str2)});
    }
    {
      MyString str1 = new MyString("JAVA IS A PROGRAMMING LANGUAG X");
      String str2 = "JAVA IS A PROGRAMMING LANGUAGE";
      logger.log(Level.INFO, "String {0}.isSimilar({1}) == {2}", new Object[]{
        str1.toString(), str2, str1.isSimilar(str2)});
    }
    {
      MyString str1 = new MyString("JAVA IS A PROGRAMMING LANGUAGE");
      String str2 = "I ATE THE CAKE";
      logger.log(Level.INFO, "String {0}.isSimilar({1}) == {2}", new Object[]{
        str1.toString(), str2, str1.isSimilar(str2)});
    }
    {
      MyString str1 = new MyString("I ATE THE CAKE");
      String str2 = "JAVA IS A PROGRAMMING LANGUAGE";
      logger.log(Level.INFO, "String {0}.isSimilar({1}) == {2}", new Object[]{
        str1.toString(), str2, str1.isSimilar(str2)});
    }
  }

  @Override
  public String toString(){
    return this.str;
  }

  private Map <Character, Integer> generateCharMap(String str){
    Map <Character, Integer> map = new HashMap<>();
    Integer currentChar;
    for(char c: str.toCharArray()){
      currentChar = map.get(c);
      if(currentChar == null){
        map.put(c, 1);
      } else {
        map.put(c, currentChar+1);
      }
    }
    return map;
  }

  public boolean isSimilar(String compareStr){
    Map <Character, Integer> compareStrMap = this.generateCharMap(compareStr);
    Set <Character> charSet = compareStrMap.keySet();
    int similarChars = 0;
    int totalStrChars = this.str.length();
    float thisThreshold;

    if(totalStrChars < compareStrMap.size()){
      totalStrChars = compareStr.length();
    }

    Iterator it = charSet.iterator();
    char currentChar;
    Integer currentCountStrMap;
    Integer currentCountCompareStrMap;
    while(it.hasNext()){
      currentChar = (Character)it.next();
      currentCountStrMap = strMap.get(currentChar);
      if(currentCountStrMap != null){
        currentCountCompareStrMap = compareStrMap.get(currentChar);
        if (currentCountCompareStrMap >= currentCountStrMap){
          similarChars += currentCountStrMap;
        } else {
          similarChars += currentCountCompareStrMap;
        }
      } 
    }

    thisThreshold = ((float) similarChars)/((float) totalStrChars);
    Logger.getLogger(MyString.class.getName()).log(Level.INFO, "similarChars: {0}, totalStrChars: {1}, thisThreshold: {2}", new Object[]{similarChars, totalStrChars, thisThreshold});
    if(thisThreshold > THRESHOLD){
      return true;
    }
    return false;
  }
}

I think what you will want to do is define similar before you attempt to define an isSimilar method though.

searchengine27
  • 1,449
  • 11
  • 26
2

There are many ways to determine the similarity of two strings. One of the most common is that of the edit distance of which the Levenshtein distance is an example of (and there are several variations and other approaches - take a glance at Category:String similarity measures on Wikipedia).

The Levenshtein distance calculates the number of changes that would be necessary to change one string into another. For example:

JAVA IS A PROGRAMMING LANGUAGE
JAVA IS A PROGRAMMING LANGUAG X

has an edit distance of two: 'E' is changed to ' ' and 'X' is inserted.

kitten
sitting

has an edit distance of 3: 'k' to 's', 'e' to 'i' and inserted 'g'.

The function then that you would be looking at writing would would likely have a prototype of boolean similar(int threshold, String foo, String bar) where the threshold is the maximum number of changes allowed and foo and bar are the two strings you are comparing.

If you are doing lots of comparisons against a single string, you might look into constructing a Levenshtein automaton which is a special type of finite automata that accepts a string if it is within some edit distance of the string the aotma automaton was build with.

1

There's no set function that does this in java, so you're going to have to build one yourself. The way you solve this completely depends on the limit of similarities you want to go to.

The approach I would take would be to utilize the split function java provides so you can iterate through each word in the sentence. Then simply compare each of the characters individually with each of the words in the other sentence.

Create some sort of ratio to make a fail or pass, this will, as I said above, depends on how similar you want it to be.

Browse through Here if you don't understand how to work with Strings in java.

Edit: There are also a different algorithms you might be interested in mentioned in another thread, here is a more specific example of One in use

Community
  • 1
  • 1
jthort
  • 409
  • 4
  • 14