Approximate regular expression library for Java?

Question

I have just written some code for approximate string matching. I would like to benchmark my naive algorithm against a more mature implementation running on the JVM. Any suggestions?

Is "approximate" a technical term I've never heard of, or is your code just "approximately correct" ? — Carl Smotricz, Jul 17 '10 at 20:26
If this is for bioinformatics, you're reinventing some complicated wheels. — msw, Jul 17 '10 at 20:31
@Carl Smotricz By approximate I mean to say that the two strings are approximately matched if they are within a certain edit distance of one another. See, for example, http://en.wikipedia.org/wiki/Approximate_string_matching. — Gabriel Mitchell, Jul 17 '10 at 20:36
@msw- Yes, I am trying to learn some things about biological sequence alignment by attacking some canonical (and fun) problems. I don't expect my hack will be competitive with state of the art alignment programs. However, I am curious to see how badly it fails. At any rate, if you do know of a good Java library, I would love to hear about it. — Gabriel Mitchell, Jul 17 '10 at 20:42
This has been an educational experience for me, thanks! I've tried to include at least a little bit of useful information in my answer. — Carl Smotricz, Jul 17 '10 at 21:31

score 2 · Accepted Answer · edited May 23 '17 at 10:27

2

I found these answers elsewhere on this site for similar problems.

Commons Lang has an implementation of Levenshtein distance.
http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringUtils.html

Commons Codec has an implementation of soundex and metaphone.
http://commons.apache.org/codec/api-release/org/apache/commons/codec/language/Soundex.html
http://commons.apache.org/codec/api-release/org/apache/commons/codec/language/Metaphone.html

(source)

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

(source: zarawesome)

edited May 23 '17 at 10:27

Community

1
1

answered Jul 17 '10 at 20:46

Gunslinger47

7,001
2
21
29

Thanks, Gunslinger. The Levenshtein distance from StringUtils will be useful for a direct performance comparison. – Gabriel Mitchell Jul 17 '10 at 22:06

score 0 · Answer 2 · answered Jul 17 '10 at 21:27

It so happens I reinvented this wheel many years ago - in a FORTRAN program on a mainframe :)

When I proudly told other people on the Internet about my algorithm, they laughed and pointed me at the two (four?) big names in this area:

Needleman-Wunsch and
Smith-Waterman.

These are algorithms for comparing huge sequences of similar strings. Memory requirement is about m + n, where m and n are the sizes of the strings, and runtime is about m * n.

Gunslinger47 mentions Levenshtein, Soundex and Metaphone. Levenshtein is also a powerful means of computing string distances, but it's better suited for "normal" text. Soundex and Metaphone compute a short string intended to encode the sound of the string when spoken by a human... they become ineffective after about 3 syllables, they're really intended for words in human language rather than strings of genomes or such.

EDIT

Oops, I just found my 4 big names at the bottom of the article you cited. So you're already aware of them. I think that if you search for those names and "Java" should find you implementations. Here's the first one I found.

Approximate regular expression library for Java?

2 Answers2

Linked