84

I'm looking for a high performance Java library for fuzzy string search.

There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc.

What Java implementations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best?

I found these, does anyone have experience with them?

Paul Roub
  • 36,322
  • 27
  • 84
  • 93
dario
  • 47
  • 1
  • 4
  • 5

8 Answers8

45

Commons Lang has an implementation of Levenshtein distance.

Commons Codec has an implementation of soundex and metaphone.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
JodaStephen
  • 60,927
  • 15
  • 95
  • 117
  • 4
    Can't comment on the others, but I find commons-langs Levenshtein's distance useful for a fuzzy equality check, not a fuzzy contains. Unfortunately you still have to write your own algorithm that uses it. This still takes some effort to do correctly (you have to match different lengths in the source string) and with good performance (bitap is probably much faster than what you could write using only Levenshtein distance). – Henno Vermeulen Oct 14 '15 at 06:37
  • @HennoVermeulen please can you share with us if you find any solution? any implementation of bitap in java? – hereForLearing May 22 '16 at 15:38
  • My answer to this question contains a link to a Java implementation (in fact it is the first one I find when googling for "Java bitap") – Henno Vermeulen May 23 '16 at 17:45
  • 1
    For those looking for a simple fuzzy search that actually returns the matched substring from the string and not a score, here is a gist: https://gist.github.com/shathor/8ad04d8923d6c07fd2f4a06e9543bebf . Edit: @sukhmel I've updated the link in this comment (deleted the old one). If it happens again, the gist should be available in my repo: https://gist.github.com/shathor – Terran May 14 '18 at 11:03
22

If you are mostly comparing short strings and want something portable and lightweight you can use the well known python algorithm fuzzywuzzy ported to Java.

You can read more about it here

ᴘᴀɴᴀʏɪᴏᴛɪs
  • 7,169
  • 9
  • 50
  • 81
  • 3
    Just had a very positive experience using fuzzywuzzy. Compared several strings within a collection of 250,000+ objects to a collection of 30,000. The fuzzy matching was efficient and effective, the api was user friendly. – The Gilbert Arenas Dagger Dec 30 '16 at 19:47
  • Great library, I integrated it with our current Android project and the initial results are very promising – A.Alqadomi Oct 23 '17 at 14:21
  • 3
    Imoprtant to note that both the Python and Java version are GPL licensed. – Asthor Mar 21 '19 at 08:02
  • 2
    Since I couldn't find this anywhere however much I searched, the way to import this is (once you have the jar on the classpath) - ```import me.xdrop.fuzzywuzzy.*;``` – Siddhartha Oct 11 '19 at 19:30
13

You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.

If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:

public static List<Integer> find(String doc, String pattern, int k)

Apache Commons StringUtils has an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version of String.equals, Bitap is like the fuzzy version of String.indexOf and still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.

Notes:

  • The Bitap algorithm seems to be mostly useful for relatively small alphabets, e.g. plain ASCII. In fact the Simon Watiau version I linked to throws an ArrayIndexOutOfBoundsException on non-ASCII characters (>= 128) so you will have to filter these out.
  • I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2 gives way too many false positives. A Levenhstein distance of 1 works better, but it cannot detect a typo where you swap two letters, e.g. "William" and "Willaim". I can think of a few ways to solve this, e.g.

    1. do a fuzzy search only when an exact search finds no matches (and show a message to the user about this)
    2. adjust Bitap to use Damerau-Levenshtein distance where a swap has distance 1 instead of 2. According to wikipedia, this is possible, but I could not find an existing implementation in Java.
    3. instead of "contains" do a "startsWith". The fuzzy search tools contains a prefix version of Damerau-Levenshtein, but it gave me an ArrayIndexOutOfBoundsException
    4. adjust the algorithm to introduce search result ranking where exact matches score higher

    If you are going to do 2 or 4, it may be better to use a proper full-text search library like Lucene anyway.

  • More information on fuzzy search can be found on this blog. It's author also created an implementation in Java called BitapOnlineSearcher, but requires you to use java.io.Reader together with an Alphabet class. It's Javadoc is written in Russian.
Henno Vermeulen
  • 1,495
  • 2
  • 15
  • 25
  • is there a way to make Bitap searchs only for words with the same number of letters for example if I search Name with k=2 Namo and Mamo are accepted but not Nam ? – hereForLearing May 22 '16 at 20:41
10

SimMetrics is probably what you need: http://sourceforge.net/projects/simmetrics/

It has several algorithms for calculating various flavours of edit-distance.

Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).

Javid Jamae
  • 8,741
  • 4
  • 47
  • 62
Darren
  • 2,888
  • 1
  • 23
  • 34
5

To Lucene I would add SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Mond Raymond
  • 320
  • 1
  • 4
  • 9
3

You can try the Completely library, it relies on text preprocessing to create an in-memory index for efficiently answering (fuzzy) searches in large data sets. Unlike Lucene and other full featured text search libraries, the API is small and easy to get started.

Filipe Miguel Fonseca
  • 6,400
  • 1
  • 31
  • 26
2

Apache Lucene is the only way, I think. I don't know any better search lib.

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Seki
  • 11,135
  • 7
  • 46
  • 70
Vugluskr
  • 1,426
  • 1
  • 12
  • 12
1

You can try bitap. I was playing with bitap written in ANSI C and it was pretty fast there is java implementation in http://www.crosswire.org.

Mojo Risin
  • 8,136
  • 5
  • 45
  • 58