Sorting String based on similarities

Question

consider the following Strings:

he llo
goodbye
hello
= (goodbye)
(he)(llo)
good bye
helium

I'm trying to sort these in such a way that similar words comes together, I know

alphanumerical sorting is not an option
removing special chars ",-_ and etc then comparing is certainly helpful but results won't be as good as I hope for.

NOTE :

there might be few different desired ouput for this, one of which is :

DESIRED OUTPUT:

hello
he llo
(he)(llo)
helium
goodbye
good bye
= (goodbye)

so my question is that if there is a java package that compares strings and ultimately sort them based on it .

I've heard of terms such as n-gram and skip-gram but didn't quite understand them. I'm not even sure if they can be useful for me at all.

UPDATE: finding similarities is certainly part of my question but the main problem is the sorting part.

possible duplicate of [Similarity String Comparison in Java](http://stackoverflow.com/questions/955110/similarity-string-comparison-in-java) — dognose, Jul 13 '15 at 09:32
Maybe the area you are searching for is NLP, Natural Language Processing, as you mention `hello` (`helium`) and `goodbye` in conjunction. The soundex algorithm is established but won't help with spaces. — Joop Eggen, Jul 13 '15 at 09:36
@dognose thx for the link, I can see its very useful for comparison. but this approach limits the sorting. how can it be used for sorting? — nafas, Jul 13 '15 at 09:51
@nafas you can use a custom comparator for this. you just need to calculate the similarity index against a "certain" expression, and sort based on that value. For instance if you reference "foo bar", "foo baz" and "baz bar" should score high, while "hello world" should score low. maybe it would also make sence to identify "similiar looking" elements, arrange them in groups and then sort the groups alphabetic. — dognose, Jul 13 '15 at 09:58
@dognose it wouldn't work most of the time. for example assuming "foo bar" compare with "blah" is 0.1 and compare with "double" is also 0.1 but doesn't necessarily mean "blah" and "double" are similar. it can get very very complicated — nafas, Jul 13 '15 at 10:02
You want to sort but you don't know on what basis exactly. Unless you have a definitive basis for sorting, this is a moot point. Similarity grouping is actually what you probably need. For which, take a look at [FALCONN](https://github.com/FALCONN-LIB/FALCONN). It doesn't have transitive property, so it can't be used for sorting, nor any other similarity grouping algorithm that I know of. For simpler, but slower mechanisms, check out Levenshtein automatons. — Unknown, Aug 04 '23 at 11:10

Simon · Answer 1 · 2015-07-13T09:55:52.113

4

Here's one possible approach.

Calculate the edit distance/Levenshtein distance between each pair of strings and then you use view the strings as a complete graph where the edge weights come from the edit distance. Choose a threshold for those weights and remove all the weights that to high. Then find the cliques in this graph. If your threshold is fairly low perhaps even finding connected components would be an option.

Note: Perhaps it would be better to substitute some edit distance with one of the similarity measures in the link that @dognose posted. Also, note that finding cliques will be very slow if you have a large numbers of strings

edited Jul 13 '15 at 09:55

answered Jul 13 '15 at 09:39

Simon

6,293
2
28
34

I've used clique approach for some similar problem before, it certainly works. but as you mentioned it can be very slow. unfortunately for me I have around 10mil+ data. so clique would be out of option – nafas Jul 13 '15 at 09:47
How about just finding connected components? – Simon Jul 13 '15 at 09:56
problem can arise when we have A-B and B-C and A-D but not A-C and not B-D then how do we decide how to sort them? – nafas Jul 13 '15 at 10:04

Sorting String based on similarities

1 Answers1