You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.
If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:
public static List<Integer> find(String doc, String pattern, int k)
Apache Commons StringUtils
has an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version of String.equals
, Bitap is like the fuzzy version of String.indexOf
and still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.
Notes:
- The Bitap algorithm seems to be mostly useful for relatively small
alphabets, e.g. plain ASCII. In fact the Simon Watiau version I linked to throws an
ArrayIndexOutOfBoundsException
on non-ASCII characters (>= 128) so you will have to filter these out.
I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2
gives way too many false positives. A Levenhstein distance of 1 works
better, but it cannot detect a typo where you swap two letters, e.g.
"William" and "Willaim". I can think of a few ways to solve this,
e.g.
- do a fuzzy search only when an exact search finds no matches (and show a message to the user about this)
- adjust Bitap to use Damerau-Levenshtein distance where a swap has distance 1 instead of 2. According to wikipedia, this is possible, but I could not find an existing implementation in Java.
- instead of "contains" do a "startsWith". The fuzzy search tools contains a prefix version of Damerau-Levenshtein, but it gave me an
ArrayIndexOutOfBoundsException
- adjust the algorithm to introduce search result ranking where exact matches score higher
If you are going to do 2 or 4, it may
be better to use a proper full-text search library like Lucene
anyway.
- More information on fuzzy search can be found on this blog. It's author
also created an implementation in Java called
BitapOnlineSearcher
,
but requires you to use java.io.Reader
together with an Alphabet
class. It's Javadoc is written in Russian.