Identify strings with same meaning in java

Question

I have the following problem. I want to identify strings in java that have a similar meaning. I tried to calculate similarities between strings with Stringmetrics. This works as expected but I need something more convenient.

For example when I have the following 2 strings (1 word):

String s1 = "apple";
String s2 = "appel";

Then those 2 strings are very similar. When I use the cosine similarity then i get the following result:

double score = cosine.compare(s1, s2); // 0.0

But when I use damerau-levenshtein similarity I get the following result:

double score = damerauLevenshtein.compare(s1, s2); // 0.8

The next problem is that there are a lot of synonyms for words. With Stringmetrics these synonyms are not considered.

For example these 2 strings should be considered the same:

String s3 = "purchase 10 bottles of water";
String s4 = "buy 10 waterbottles";

I hope you guys can help me.

I don't think you grasp the complexity of stuff like this ;) — Tim, Apr 26 '17 at 13:34
Oh it's simple. You just need 5 trillion `if` statements. Feel free to post the completed code to Code Review. — Michael, Apr 26 '17 at 13:35
You need a list of all synonyms. There is nothing you can do without the linguistic knowledge... — Obenland, Apr 26 '17 at 13:35
You could invest 20 years of research and still not come up with a solution that covers your requirements. This is very complicated and far too complex for a SO question. — f1sh, Apr 26 '17 at 13:35
I would be amazed if a computer could do this relatively easily (unless you are Google or Watson). — Steve Smith, Apr 26 '17 at 13:36
Too much complex. You can't get such result even after Natural Language Processing. — Sagar Gautam, Apr 26 '17 at 13:38
Well I know that this isnt an easy task thats why I asked you guys. So what kind of similarity analysis is possible these days? Is the usage of string metrics the only approach to this? What about nlp? Could I achieve better results with it? — sstoeferle, Apr 26 '17 at 14:08

score 0 · Accepted Answer · answered Apr 26 '17 at 14:14

Levenshtein distance (edit distance) is like the auto-correct in your phone. Taking your example we have apple vs appel. The words are kinda close to each other if you consider adding/removing/replacing a single letter, all we need to do here is swap e and l (actually replace e with l and l with e). If you had other words like applr or appee - these are closer to the original word apple because all you need to do is replace a single letter.

Cosine similiarity is completely different - it counts the words, makes vector of those counts and checks how similiar the counts are, here you have 2 completely different words so it returns 0.

What you want is: combo of those 2 techniques + computer having language knowledge + another dictionary for synonyms that are somehow taken into consideration before and after using those similarity algorithms. Imagine if you had a sentence and then you would replace every single word with synonym (who remembers Joey and Thesaurus?). Sentences could be completely different. Plus every word can have multiple synonyms, and some of those synonyms can be used only in a specific context. Your task is simply impossible as of now, maybe in the future.

P.S. If your task was possible I think that translating software would be basically perfect, but I'm not really sure about that.

Identify strings with same meaning in java

1 Answers1