Compare array contents accuracy algorithm

Question

I have voiceResults in an array to search for a contact:

Ben McDonald
Ben MacDonald
Ken McDonald
Ken MacDonald

I have established the potentialMatches (example) in another array:

Ben McDonald
Benjamin MacDonald
Donna McBlead //anagram
Ben Mad
abcdelmno //occurrences in alphabetical order
onmledcba //occurrences in reverse alphabetical order
completely Random
a cannon
BK Lounge

My goal is to establish which is the most likely contact the user wanted to view.

When looping through the two arrays, I want to use similar logic to the game Mastermind, where I can see if a letter is 'right but in the wrong place' or 'right and in the right place'. I can then compare it to the .length() of the element and get a floating percentage of letter matches and exact position matches.

To do the above, I not only need to loop between elements of the arrays, I then need to break the elements up by their letters and compare those individual element letters. To establish the Mastermind/anagram logic, I would need to remove letters that are matches until I'm left with the letters that don't match and again compare that amount to the original length to get a percentage.

Looking at the example array data above, I also need to perform this in reverse and spit first names and last names.

For each array I started with the following:

    ArrayList<String> voiceResults = new ArrayList<String>();
    ListIterator<String> itr = voiceResults .listIterator();
    Arrays.asList(voiceResults.toArray());

    while (itr.hasNext()) {
    sid = itr.nextIndex();
    element = itr.next();

    sidpass = sid.toString();
    rawpass = element.toString().toLowerCase();
    rawpass.trim();

    hcs = rawpass.split("\\s");
    hnc = hcs.length;

    if (hnc == 2) {
    fn = hcs[0]; //first name
    ln = hcs[1]; //last name
    fn = fn.replaceAll("[^a-z]", ""); //remove punctuation
    ln = ln.replaceAll("[^a-z]", "");

    }

I post the above, but I'm certain it's not the correct starting method.

Reading through many examples of anagram checks and algorithms, they vary hugely and use for and while loops, hashmaps, hash-tables, histograms, floating values etc.

I hold my hands up, I'm at a total loss of which is the best/fastest/most practical method to initially perform these loops, inner loops, inner element loops...

It would be greatly appreciated if I could have some suggestions of how the loops should be structured to begin with.

Further suggestions/examples/links of the letter comparisons and reverse iteration would be fantastic. Hopefully I can piece everything together then.

Finally, how should I store these percentages that relate to the elements??

I thank you in advance.

PLEASE NOTE: Although the example data may suggest otherwise, I have already used a loop and .contains() .matches() etc etc.

You will be searching this list of names present in DB right? — Sashi Kant, Jun 08 '12 at 12:56
Initially .contains() or .matches() using cur.getString(cur .getColumnIndex(ContactsContract.Contacts.DISPLAY_NAME)); It would ideally take place during this first loop, but, time consuming or unnecessary if an exact match could be initially found. — brandall, Jun 08 '12 at 13:09

score 1 · Accepted Answer · edited May 23 '17 at 12:34

1

There are many different spelling algorithms but in the past I have used Levenshtein or Soundex (each have their pros and cons). Soundex might work better for you since your getting this from vocal.

You might also want to check out:

Getting the closest string match

and

What algorithm gives suggestions in a spell checker?

edited May 23 '17 at 12:34

Community

1
1

answered Jun 08 '12 at 12:38

Adam Gent

47,843
23
153
203

Fantastic, thank you so much - I can't believe in my hours of searching I hadn't come across that stackoverflow post... – brandall Jun 08 '12 at 12:47

Yilmaz Guleryuz · Answer 2 · 2012-06-10T19:44:02.770

1

You can also use this library; http://code.google.com/p/string-similarity/

implementation is clean, easy to customize per your need.

for example lets use JaroStrategy for string comparison

    double similarity = 0.0;
    // Calculates the similarity score of objects, whereas 
    // 0.0 implies absolutely no similarity 
    // 1.0 implies absolute similarity.   
    SimilarityStrategy strategy = new JaroStrategy();
    StringSimilarityService service = new StringSimilarityServiceImpl(strategy);

similarity = service.score("Ben McDonald", "Ken MacDonald");

edited Jun 10 '12 at 19:44

answered Jun 08 '12 at 12:59

Yilmaz Guleryuz

9,313
3
32
43

That's great! I can't see the usage documentation, but I'll have a search around for it. Thank you. – brandall Jun 08 '12 at 13:13
1

well, glad to hear that. just added sample usage code. feel free to accept the answer if you think this was helpful – Yilmaz Guleryuz Jun 10 '12 at 19:45
Thanks again. I found the usage documentation and indeed it provides a very accurate 'score'. Problem is, that when looping through my 1,000 contacts to test, it is painfully slow... I'm accessing the contacts through the standard cur.getString(cur .getColumnIndex(ContactsContract.Contacts.DISPLAY_NAME‌)); whilst iterating through the voice data as demonstrated in the OP - If you can suggest a way this can be done efficiently/fast(!), you will have my 100% accepted vote! Thanks in advance... – brandall Jun 10 '12 at 19:54
1

improving contacts loading is a separate topic, and I'm sure answers exists in stackoverflow already :) regarding faster comparison; this library is pretty fast in iterations, on top of that I suggest you to do this processing inside AsyncTask to avoid blocking UI. – Yilmaz Guleryuz Jun 10 '12 at 21:03

Compare array contents accuracy algorithm

2 Answers2