fast algorithm to compare a list of string for similarity

Question

i was given a list that contains over 90,000 names. i am to check the names that have >= 50% similarity, and write the result to a file in the format:

ID 1, ID 2, Similarity percent.

i already have an algorithm that checks the similarity, but iterating through the whole list takes alot of time. Can someone help out with a fast algorithm to compare the names?

below is the code

public static void main(String[] args) throws IOException {


    List<String> list = new ArrayList<>();
    int count = 0;
    FileWriter f = new FileWriter(new File("output.txt"));
    StringBuilder str = new StringBuilder();
    Scanner scanner = new Scanner(new File("name.csv"));

    while (scanner.hasNextLine()) {


        count++;
        list.add(scanner.nextLine());

    }


    long start = System.currentTimeMillis();

    //////////////////////////////////////////////////////////
    for (int i = 0; i < list.size(); i++) {

        for (int j = i + 1; j < list.size(); j++) {


            int percent = StringSimilarity.simi(list.get(i), list.get(j));
            if (percent >= 50) {

                str.append("ID " + i + ",ID " + j + "," + percent + " percent");
                str.append("\n");
            }
        }
    }
    ////////////////////////////////////////////////////////

    long end = System.currentTimeMillis();

    f.write(str.toString());

    System.out.println((end - start) / 1000 + " second(s)");

    f.close();
    scanner.close();

}

public static String getString(String s) {
    Pattern pattern = Pattern.compile("[^a-z A-Z]");
    Matcher matcher = pattern.matcher(s);
    String number = matcher.replaceAll("");
    return number;
}

This is a sample of how the data looks.....the names are stored in a . csv file, so I read the file and stored the names in the list.

FIRST NAME,SURNAME,OTHER NAME,MOTHER's MAIDEN NAME

Kingsley, eze, Ben, cici

Eze, Daniel, Ben, julie

Jon, Smith, kelly, Joe

Joseph, tan, , chellie

Joseph,tan,jese,chellie

....and so on A person can have 3 NAMEs at least.....like I stated earlier, the program is to check how similar the names are, so when comparing Id 1 and id 2, "ben" is common and "eze" is common, so they have a 50 percent similarity. Comparing id 4 and id 5, the similarity is 75percent....because they have 3 names in common even though id 4 doesn't have a 3rd name....

So the problem is...during the similarity check using the two for loops, I start with the 1st id and check it through the remaining 90,000 names and save the id's that it has >= 50 percent similarity with, then take the next id 2 and do same......and so on

There is the **soundex** algorithm that often is available on databases too, and creates a minimal "same sounding" letter group. — Joop Eggen, Sep 16 '16 at 17:38
Is *Joseph* similar to *Joe*? How about *Kady* and *Catie*? Are you only after similarly spelled names, or are you after similar sounding names, or are you after names likely to be aliases or nicknames of each other? — jxh, Sep 16 '16 at 17:42
Without knowing what "similar" means, it's impossible to answer your question. String similarity probably isn't the way to go. For example, string similarity will tell you that "Joe Smith" is more similar to "Jim Smith" than "James Smith" is, even though it's likely that "Jim Smith" and "James Smith" are the same person. — Jim Mischel, Sep 16 '16 at 18:12
By similarity I mean........Jon, Smith, Joe, kenny and Jon, Smith, king, kelly have a 50% similarity, because they have two names in common....if they have three names, then it's 75%, if the have four names, then it's 100%...I hope I answered your question correctly — kuebano, Sep 17 '16 at 10:33
@kuebano: I think you should update the question. The question is interesting, but easily misleading people that spend time for writing answers. And there are still open questions: What is the similarity of a pair with differing length, e. g. "Jon, Smith, kenny" and "Jon, Smith, king, kelly". Is the similarity symmetric? What is a name? Is "Jon" a name or "Jon, Smith, king, kelly"? This is confusing and leads to misunderstandings. — mm759, Sep 17 '16 at 13:58
How about giving us some sample data? Your example of "Jon, Smith, Joe, kenny" and "Jon, Smith, king, kelly" didn't really explain things. Are you trying to compare individual names to see if they're 50% similar to other individual names? Or are you comparing lists and writing the names that occur in both lists to a file? — Jim Mischel, Sep 17 '16 at 16:20
Does order matter? Would "Jon, Smith, kelly, joe" be the same as "joe, Jon, Smith, kelly"? — Jim Mischel, Sep 19 '16 at 18:21

score 2 · Answer 1 · answered Sep 16 '16 at 18:35

The simularity function is assumed to be optimal: if 6 from 11 letters are different, simply return already, say 0.

One slight improvement would be to not use StringBuilder and to skip already found matches. This is a bit critical, as it might be A ≈ B ∧ B ≈ C ∧ A ≉ C so that some matches are lost.

Charset charset = StandardCharsets.ISO_8859_1; // Better UTF_8

Path inputPath = Paths.get("names.txt");
List<String> list = Files.readAllLines(inputPath, charset);

Path outputPath = Paths.get("output.txt");
try (PrintWriter out = new PrintWriter(Files.newBufferedWriter(path, charset))) {

    int n = list.size();
    for (int i = 0; i < n; ++i) {
        list.set(i, normalize(list.get(i)));
    }

    for (int i = 0; i < n; ++i) {
        String ithWord = list.get(i);
        for (int j = i + 1; j < n; ++j) {
            String jthWord = list.get(j);
            if (jthWord != null) {
                int perc = similarity(ithWord, list.get(j));
                if (similarity >= 50) {
                    out.printf("ID %d,ID %d,%d percent or greater%n", i, j, perc);
                    list.set(j, null); // Skip it for other i
                }
            }
        }
    }
 }

One could use parallelism of java 8:

final List<String> list = ...
IntStream.range(0, list.size())
    .parallelStream()
    .map(i -> ...
    ...

But that would change nothing of the quadratic complexity.

What would help would be to sort the list and from the ith word derive all prefixes that would be in the 90 % range. Unfortunate with 50 % that is not feasible (n over n/2).

I would ask for other requirements like sounds similar, have at most 3 typos or such. Or run it at night.

mm759 · Answer 2 · 2016-09-17T14:30:53.477

The following comment of the author of the question is important:

By similarity I mean........Jon, Smith, Joe, kenny and Jon, Smith, king, kelly have a 50% similarity, because they have two names in common....if they have three names, then it's 75%, if the have four names, then it's 100%

A map-based approach can be used as already suggested by Sakalya. I propose to use a HashMap with sets of names as values and sets of name-parts of keys. A mapping could be for example:

{"Jon", "Smith"} -> {"Jon, Smith, Joe, kenny", "Jon, Smith, king, kelly"}

The idea for populating the map is to take each name, create the set that contains all name-parts and create also all subsets of this set (excluding the empty set). If you have the name "Jon, Smith, Joe, kenny" the sets will be:

{"Jon"}, {"Smith"}, {"Joe"}, {"kenny"},
{"Jon", "Smith"}, {"Jon", "Joe"}, {"Jon", "kenny"}, {"Smith", "Joe"}, {"Smith", "kenny"}, {"Joe", "kenny"},
{"Jon", "Smith", "Joe"}, {"Jon", "Smith", "kenny"}, {"Jon", "Joe", "kenny"}, {"Smith", "Joe", "kenny"}
{"Jon", "Smith", "Joe", "kenny"}

The name has to be added as value-element to the map for each set as key. This has to be done for each name.

After populating the map, one has to iterate again over each name. One has to create the part-sets of the name again. The idea is to find other names that have a set in common. Only sets with a minimal size are relevant so that another name that shares the set has a similarity >= 50%. Finding these names can be done by querying the map for each relevant set.

If I don't miss anything the complexity (time as well as space) is linear concerning the number of names. The maximal number of parts of a name is assumed to be constant. The number of part-sets of a name with n parts is 2^n-1 (cp. "power set"):

1 part: 1
2 parts: 3
3 parts: 7
4 parts: 15
5 parts: 31
6 parts: 63
7 parts: 127

The space-requirements are higher than the ones of the algorithm belonging to the question, but I think they will still be no problem on an ordinary desktop-computer. Let's say each name has 20 sets (average) and each set needs 40 bytes. The needed space will be 90,000*20*40 = 72,000,000 bytes in this case. The space-requirements can be decreased by using the string-pool with String.intern().

I think the main task is to create the name-part-sets from the set that contains all parts of a name. This gives examples on how to do it: http://stackoverflow.com/questions/1670862/obtaining-a-powerset-of-a-set-in-java. — mm759, Sep 19 '16 at 14:10

score 0 · Answer 3 · answered Sep 16 '16 at 17:37

0

Your algorithm is O(n^2) for Similarity.The fastest way to do this,is just scan one list and keep the values of those list in a hashmap as Key values.And when you scan the second list then check if that element is already present in the hashmap.This will work faster.

answered Sep 16 '16 at 17:37

Sakalya

568
5
15

The problem is that the question is about similar names and not equal names. – mm759 Sep 16 '16 at 17:43
What is the second list? Which values do you store in hashmap? – serhiyb Sep 16 '16 at 17:43
The question says "i already have an algorithm that checks the similarity, but iterating through the whole list takes alot of time. Can someone help out with a fast algorithm to compare the names?" – Sakalya Sep 16 '16 at 17:48
The soundex code of a word could be used as map-key (cp. comment of Joop Eggen). This uses a relation as similarity relation that is an equivalence relation. – mm759 Sep 16 '16 at 17:48

score 0 · Answer 4 · edited May 23 '17 at 12:32

0

There are many String matching algorithms present and already there are lots of discussion done on SO.

Go through this links https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java

edited May 23 '17 at 12:32

Community

1
1

answered Sep 16 '16 at 17:48

Rishal

1,480
1
11
19

The challenge of this question is to find pairs with minimal similarity without having to compare each word with each other. – mm759 Sep 16 '16 at 17:53
@mm759 the algorithm discussed over there compares the minimum number of single character edit i.e deletion,insertion or substitution, So it optimised already but as far as requirement is concerned further optimisation can be done by going through the algorithm approach. – Rishal Sep 16 '16 at 18:00

score 0 · Answer 5 · answered Sep 16 '16 at 17:51

0

You could also parallelize this task quite easily IMHO. Just compute more than one similarity at same time. It will not improve time complexity of algorithm, but better than nothing. :-)

answered Sep 16 '16 at 17:51

Firzen

1,909
9
28
42

Parallelizing an n^2 algorithm doesn't usually help much when n gets very large. In his case he's talking on the order of 8 billion comparisons. Sure, you could use four cores and do it nearly four times as fast, but a better algorithm can likely make it 400 times as fast on a single core. – Jim Mischel Sep 16 '16 at 18:09
Yes, I completely agree. But as I said - in practise, parallelization is just better than noting. – Firzen Sep 16 '16 at 18:34

fast algorithm to compare a list of string for similarity

5 Answers5