I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.
Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:
BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex
but I'm not well versed in these algorithms and what would be a good choice ?
I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.
Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.