Recently I have been working on DNA sequence matching algorithms and their comparisons. I have implemented standard Naive, KMP, and Robin-Karp algos for the same purpose.
After executing in Java (8Gb RAM, Intel I5 processor, 1GB hard disk), I noted that naive works faster than KMP and RK.
But I got astonished after knowing that for DNA sequences up to 100,000 characters and pattern of 4 characters, naive(6ms) still outperforms KMP(11ms) and RK(17ms). I am confused as to why so this is happening and how can this be possible?
Does naive works really that fast or JVM is throwing some random garbage values, or am I placing the time instances of Java at the wrong places?
Any help is much appreciated.

- 29
- 6
-
How are you measuring performance? (But that said: it's not implausible that naive is faster. Java's built-in `String.indexOf` uses the naive algorithm because it works better most of the time in practice.) – Louis Wasserman Nov 18 '20 at 19:34
-
The performance is measured using System.currentMilliSeconds(). The code using the standard charAt() function to match the Strings. – anony_std Nov 18 '20 at 19:37
-
https://stackoverflow.com/q/504103/869736 explains how that way of measuring performance can lead to very inaccurate results. – Louis Wasserman Nov 18 '20 at 19:42
1 Answers
There are a number of factors that might be contributing to this. Here's a few things to think about:
A four-character search string is pretty short - in fact, that's so small that a naive search would likely be extremely fast. The reason that KMP and Rabin-Karp are considered "fast" string searching algorithms is that they scan each character of the input strings, on average, at most a constant number of times. If you have a four-character string, you'll also scan each character of the input at most a constant number of times as well, and it's a low constant (4). So this may simply be the constant factor terms from the KMP and Rabin-Karp outweighing the cost of a naive search. (This is similar to why many sorting algorithms switch to insertion sort for small input sizes - while insertion sort is worse for large sequences, it's much, much faster than the "fast" sorting algorithms on small inputs.) You may want to mix things up a bit and try different string lengths, nonrandom inputs, etc.
With a four-character sequence drawn from a genome there are 44 = 256 possible combinations of random strings to search for. Therefore, on expectation you'll find that sequence after reading at most 256 four-character blocks from the string being searched. That means that you'd need, on expectation, at most 1024 characters to be read in order to find your string, so the fact that the genomes are 100,000 characters long is likely irrelevant. You probably are more accurately dealing with "effective" string lengths closer to 1000, and, like in part (1), if you have smaller inputs to the algorithms the benefits of KMP and Rabin-Karp diminish relative to their increased constant factors.
This could also be an artifact of how the code is implemented. Which versions of KMP and Rabin-Karp are you using?
String.indexOf
is likely heavily optimized by the JVM implementers; are you similarly using highly-optimized versions of KMP and Rabin-Karp?

- 30,962
- 25
- 85
- 135

- 362,284
- 104
- 897
- 1,065
-
The implementation of KMP follows from here: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/ `String.indexOf` is nowhere used in the implementation. In the loops `string.charAt is used for the comparison purpose. – anony_std Nov 18 '20 at 20:19