12

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.

Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:

https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics

BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex

but I'm not well versed in these algorithms and what would be a good choice ?

I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.

Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.

Paul Taylor
  • 13,411
  • 42
  • 184
  • 351
  • simplistic: split strings on spaces into arrays-of-words, do an array intersection. intersections with higher counts means more words in common. – Marc B Nov 28 '14 at 15:57
  • A Lucene BooleanQuery consisting of TermQueries would score each document according to the number of matching words, that sounds a lot like what you're looking for. – Marko Topolnik Nov 28 '14 at 16:10
  • @MarkoTopolnik yes but im just trying to compare two strings processing as lucene neans creating two documents, a ramdirectory, analysers and is too slow. Shoud be able to get a good match without using lucene – Paul Taylor Nov 28 '14 at 16:18
  • @MarcB but that doesnt consider spelling variations and misspellings I'm looking for help using choosing one of of the simialrity metrics already created – Paul Taylor Nov 28 '14 at 16:19
  • I thought you want to compare N titles, each with each. Of course it doesn't pay off if your unit of work is just comparing *two* titles. In that case a stupid O(n^2) algo, but with low constant factor, would be a far better option. – Marko Topolnik Nov 28 '14 at 16:22
  • I have code which finds good suggestions based on user entry, that's very close to your needs. I just split on space and compare each word with each other word using Jaro-Winkler distance. – Marko Topolnik Nov 28 '14 at 16:25
  • Try comparing based on Levenshtein distance. This is covered [in this question](http://stackoverflow.com/questions/6087281/similarity-score-levenshtein). – mindas Nov 28 '14 at 17:41
  • Apache Commons [StringUtils.getLevenshteinDistance](http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#getLevenshteinDistance(java.lang.CharSequence,%20java.lang.CharSequence)) – femtoRgon Nov 28 '14 at 18:01
  • @MarkoTopolnik thanks that is the kind of answer I was looking for, okay I can use Jaro-Winkler but I was hoping I could just use with the complete string, is it really neccessary to split into seperate words and if so how do you compare each word with every other word efficiently, and how do you calculate a final similarity score. – Paul Taylor Nov 29 '14 at 11:28
  • @mindas yes levenstein is the one algorithm I do understand, but I was under the impression there are now better algorithms – Paul Taylor Nov 29 '14 at 11:29
  • I think you want your algo to be insensitive to word reorderings or missing words. I don't see how you imagine that being doable without splitting into words. As for my solution, it's not that simple in detail: I prepare a similarity score for each distinct word pair (which is *based* on Jaro-Winkler, but applies a "high contrast" function to its output, such that JW score of 0.87 is transformed to 0.0 and 1.0 to 1.0), then collect the best matches, sum the scores, and normalize. I also have special treatment for common words, devaluating (but not completely ignoring) matches on them. – Marko Topolnik Nov 29 '14 at 12:21
  • Normalization is done such that I give high score to titles which match on any salient (non-common) word, and only slightly improve that score for further matching words. This is specific to my problem domain, you would probably want to do that aspect differently. – Marko Topolnik Nov 29 '14 at 12:23
  • @MarkoTopolnik no i dont want it to be insensitive to word reorderings, but i would hope for is an algorithm that treats spaces between words a bit specially. – Paul Taylor Nov 29 '14 at 19:19
  • And what does that "bit" amount to? I'm not exactly following you. – Marko Topolnik Nov 29 '14 at 19:28
  • @MarkoTopolnik okay if I was comparing 'hello world' to 'hallo world' that should be a closer match then 'hello world' compared to 'he lo world' because in the first example both have two words just slight spelling different whereas in second case we have two words compared to three, so I would hope there are algorithms that take take the greater significance of spaces into account – Paul Taylor Nov 29 '14 at 19:59
  • e.g consider an amended LevensteinDistance that calculate the number of changes/insertions/deletions to get from s1 to s2 but counts modification of a space twice so that its more costly – Paul Taylor Nov 29 '14 at 20:45
  • "Hell oworld" would be four times farther from "Hello world" than "Hallo world" because one space is added and another removed. What you are after instead seems to be simple word counting. You want similiarity between words at the same position, which would be easily achieved by splitting. – Marko Topolnik Nov 29 '14 at 21:31
  • Generally true, but sometimes there would be extra words in the middle. It just seems what Im requesting (Comparison of sentences not just individual words) is not an obscure request and Im looking for an authoritive reference on it. – Paul Taylor Nov 29 '14 at 22:49
  • Extra words in the middle would pull any full-string-matching similarity score to near zero, due to all the single-character insertions those words bring in. I don't think you would be satisfied at all with that kind of result. Comparison of sentences is done via tokenization into words, that's why what you're asking for is not going to yield any positive answers. – Marko Topolnik Dec 01 '14 at 13:56
  • But hasn't anybody combined these two steps to make an algorithm that properly considers sentences, Lucene does in fact do this however it doesnt make sense to use it when just comparing two sentences – Paul Taylor Dec 01 '14 at 16:25

5 Answers5

4

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.

I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.

Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.

To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.

If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.

TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.

M.P. Korstanje
  • 10,426
  • 3
  • 36
  • 58
  • I think Ive been fairly clear about what I need, my problem is that its unclear wht these fifferent algorithms is god for. But good advise and your specific advice about JaccardSimailrity sounds promising. – Paul Taylor Dec 05 '14 at 16:25
  • Also have a look at the org.simmetrics.SimonWhite variant. It looks like it is an implementation of Sorensons metric which is related to Jaccard. – M.P. Korstanje Dec 06 '14 at 13:18
  • @PaulTaylor I'll have to make a few corrections. In my own project I'm using JaccardSimilarity. However I'm using a 2-gram tokenizer, not the white-space tokenizer that is used by simmetrics. In simmetrics this is the DiceSimilarity metric. It uses the 2-gram tokenizer. Dice similarity is alos known as Sorensons metric. The SimonWhite variant is similar to both but uses a combination of whitespace and 2-gram tokenizer and uses the array-intersection rather then set difference. The effects on the ranking are rather subtle and too complex to describe in a comment. – M.P. Korstanje Dec 06 '14 at 16:49
  • 1
    I haven't implemented a solution based on your advice yet, but it seems like good advice and directly answers the question so Im offering you the bounty – Paul Taylor Dec 08 '14 at 10:05
  • Cheers. I had some fun with the simmetrics library, I'm happy you pointed me in its direction. Currently doing a refactoring so I can merge some of the stuff I developed on my own in there. It is a nice library but hasn't been maintained for the last two years and it shows. Check out: https://github.com/mpkorstanje/simmetrics/blob/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics/ SimonWhite.java or Dice.java with a TokeniserWordQGram.java and TokeniserQGram2.java it will save you some time. – M.P. Korstanje Dec 08 '14 at 10:56
  • great will look at this properly in NewYear. But I notice with my current usage with CosineSimilarity that sometimes the tokenizer got stuck at uk.ac.shef.wit.simmetrics.tokenisers.TokeniserWhitespace.tokenizeToArrayList(TokeniserWhitespace.java:133) even though both sentences were less than 100 chars long. I see from youtr github refactoring you mention tokenizer caching and that sounds very useful as I am indeed typically comparing the same sentence to about 10 sentences. – Paul Taylor Dec 21 '14 at 11:40
  • If you are using the 1.6.3 release I'm not surprised. It is buggy as hell. I've fixed the source code for that but there hasn't been a new release yet. You may want to build it yourself. Either way can you post the exception and input strings on github? – M.P. Korstanje Dec 21 '14 at 15:11
  • The problem only occurs for one customer there was no exception it just never finished at the line I posted above, I have now deployed a new version of the application using a version of simmetrics built from your master fork on github if the problem reoccurs Ill gather that information and post on github – Paul Taylor Dec 21 '14 at 15:45
  • Yeah the error no longer occurs when build your version of 1.6.3 :) – Paul Taylor Jan 05 '15 at 11:46
3

You are likely need to solve a string-to-string correction problem. Levenshtein distance algorithm is implemented in many languages. Before running it I'd remove all spaces from string, because they don't contain any sensitive information, but may influence two strings difference. For string search prefix trees are also useful, you can have a look in this direction as well. For example here or here. Was already discussed on SO. If spaces are so much significant in your case, just assign a greater weight to them.

Community
  • 1
  • 1
Mikhail
  • 4,175
  • 15
  • 31
  • So is there not a name for the best algorithm you describe rather than just use Levenstein but make it better, I know already about Levenstein but there are a whole lots of other algorithms linked to in the question, I will add them to the question to make it clearer. – Paul Taylor Dec 03 '14 at 12:32
  • The best algorithm depends on you concrete domain. I'd start from someone. And then modify it solve my problem. Levenstein looks as a good candidate. – Mikhail Dec 05 '14 at 08:04
1

Each algorithm is going to focus on a similar, but slightly different aspect of the two strings. Honestly, it depends entirely on what you are trying to accomplish. You say that the algorithm needs to understand words, but should it also understand interactions between those words? If not, you can just break up each string according to spaces, and compare each word in the first string to each word in the second. If they share a word, the commonality factor of the two strings would need to increase.

In this way, you could create your own algorithm that focused only on what you were concerned with. If you want to test another algorithm that someone else made, you can find examples online and run your data through to see how accurate the estimated commonality is with each.

I think http://jtmt.sourceforge.net/ would be a good place to start.

Jase Pellerin
  • 397
  • 1
  • 2
  • 13
1

Interesting. Have you thought about a radix sort?

http://en.wikipedia.org/wiki/Radix_sort

The concept behind the radix sort is that it is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits. If you convert your string into an array of characters, which will be a number no greater than 3 digits, then your k=3(maximum number of digits) and you n = number of string to compare. This will sort the first digits of all your strings. Then you will have another factor s=the length of the longest string. your worst case scenario for sorting would be 3*n*s and the best case would be (3 + n) * s. Check out some radix sort examples for strings here:

http://algs4.cs.princeton.edu/51radix/LSD.java.html

http://users.cis.fiu.edu/~weiss/dsaajava3/code/RadixSort.java

applecrusher
  • 5,508
  • 5
  • 39
  • 89
1

Did you take a look at the levenshtein distance ?

int org.apache.commons.lang.StringUtils.getLevenshteinDistance(String s, String t)

Find the Levenshtein distance between two Strings.

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).

The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm

Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings. This implementation of the Levenshtein distance algorithm is from http://www.merriampark.com/ldjava.htm

Anyway, I'm curious to know what do you choose in this case !

Donatello
  • 3,486
  • 3
  • 32
  • 38