1

I have a list of a bunch of phrases. Because this is a fairly long list, I also have a text box which users can type into as a search bar. As of right now, terms that do not exactly contain with the letters in the search bar are filtered out. However, I would like to have it give a list of a few suggestions of what the word might be.

Note: I am not looking for a "Did you mean..." or spell checking algorithm like the ones here or here or here (though this image from the first link seems good); I want an algorithm that will be able to suggest the best match for an incomplete word or phrase; e.g. the word "bat" should be a better match of the word "battery" than the word "car".

It would also be impractical to use Google's method of returning the few strings that are most common that start with (approximately) the same letters, because, as far as I know, each element in the list would be equally as common as any other.

Also, I would like to do this in Java (8); however, other language answers are acceptable, as long as they do not use built in functions for which Java has no equivalent. In case it is useful, I wrote a modified version of Levenshtein distance (below) which fills the search string with asterisks signifying "any character." This works for single words, e.g. "mud" is a perfect match of "muddy", but isn't good enough when considering people may use "car" to search for "race car".

/**
 * <ul>
 * <b><i>searchDistance</i></b><br>
 * <br>
 * <code>&nbsp;public static int searchDistance(String key, String match)</code><br>
 * <br>
 * Gets the Levenshtein distance between <code>key</code> and <code>match</code>. <br>
 * If <code>useAsterisk</code> is true, then the follwing applies: If <code>key</code> is shorter than <code>match</code>, the asterisk <code>'*'</code> is appended to it until the lengths are equal. Asterisks can be used in <code>key</code> to signify 'any character.'
 * @param key - The text to search for
 * @param match - The text to compare <code>key</code> against
 * @param useAsterisk - Whether or not to use asterisks for the purpose described above
 * @return the Levenshtein distance between <code>key</code> and <code>match</code>.
 *         </ul>
 */
public static int searchDistance(String key, String match, boolean useAsterisk) {
    while (key.length() < match.length()) {
        key = key + "*";
    }

    int[][] matrix = new int[key.length() + 1][match.length() + 1];

    for (int i = 0; i < matrix.length; i++) {
        matrix[i][0] = i;
    }

    for (int i = 0; i < matrix[0].length; i++) {
        matrix[0][i] = i;
    }

    for (int a = 1; a < matrix.length; a++) {
        for (int b = 1; b < matrix[0].length; b++) {
            matrix[a][b] = Math.min(Math.min(matrix[a - 1][b] + 1, matrix[a][b - 1] + 1), matrix[a - 1][b - 1] + (key.charAt(a - 1) == match.charAt(b - 1) || key.charAt(a - 1) == '*' ? 0 : 1));
        }
    }

    return matrix[matrix.length - 1][matrix[0].length - 1];
}

TL;DR: Is there a good way to give completion suggestions for search terms?

Thanks in advance!

Community
  • 1
  • 1
ricky3350
  • 1,710
  • 3
  • 20
  • 35

2 Answers2

1

Try to look at, K Shingles method In: http://infolab.stanford.edu/~ullman/mmds/book.pdf :page 77

It might give to some idea for impelenting such fuzzy search system

Masoud
  • 1,343
  • 8
  • 25
  • Looks pretty good, trying it out; however, it is still a method of comparison, not completion, and also is for documents, mot small phrases. Still could be good; thanks. – ricky3350 Jul 15 '16 at 20:08
1

There's always the simple, brute-force method. Even with a fairly large set of phrases, it can work well.

Imagine you have a list of 1 million phrases. The user enters the letter 'c'. You search your list of phrases for all of them that contain the letter 'c', and display them. You also keep that result.

The user then types 'a'. Now, you search for the string "ca" in the list of strings that was returned from the previous search. So already you've cut your search from all phrases to just those phrases you know contain the letter 'c'. Considering that about 37% of English words contain the letter 'c' (see http://phrontistery.info/ihlstats.html), you've already cut your list by almost two-thirds.

Anyway, you now have a list of phrases that contain the letters "ca". This list is going to be rather small when compared to the list of all phrases. You can continue to refine your list as the user types characters.

If the initial search of the entire list takes too long, you can easily optimize this by creating a dictionary, indexed by letter, and having a list of the words that contain that letter. So the entry for 'c', for example, would contain "race car", "car", "cat", "master carver", etc. So there's no searching involved to get the initial list.

Another benefit of using the dictionary approach is that you can preprocess the list for each letter so that words that start with the letter are at the front of the list. That's kind of nice because most of the time when somebody's searching, he's looking for a word or phrase that starts with the first letter he types. But you could easily arrange by popularity or any other criteria.

I have used this approach a number of times, and it works quite well. It's dead simple to implement, and usually performs fast enough without needing any optimization. The dictionary optimization I mentioned above was sufficient for all but a few of the cases where the simple brute force method didn't work, and one time I needed two dictionaries: one for the first character, and one for letter pairs.

Even if this turns out not to be the final solution, it's useful to have because it's easy to prove correct and to test other, more involved, algorithms against.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Yes; such a method would work pretty well. However, while "ca" would give "car," "cat," and "race car," it would also give things such as "beCAuse," or "electriCAl," which is are unlikely completions. This will probably sill be a part of the final solution, and as you said, it's a good test metric. – ricky3350 Jul 15 '16 at 20:08
  • Also - my list is not _that_ long; "this is a fairly long list" only refers to the fact that it would be tedious to navigate it all as a user, especially if the user doesn't know exactly what the entry he/she is looking for is called; it's probably only about 200 or so entries long. – ricky3350 Jul 15 '16 at 20:10
  • @ricky3350: If the list is reasonably small (and 200 is pretty small), you can do a lot of preprocessing to make sure relevant things show up at the top of the list. In the "ca" case, for example, you can hand-construct the order in which items show up so "car," "cat," and "race car" show up before "because" and "electrical." – Jim Mischel Jul 19 '16 at 15:47