Which string-finding algorithm is appropriate for this?

Question

I have a big string say "aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" (but maybe longer) and I have a collection of lots of little strings. I want to count (overlap is OK) how many times the little strings are found in the big string. I care only about speed. KMP seemed good but it looked like Rabin-Karp dealt with multiple but was slow.

I believe this question might help you a little bit: http://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm -- that link in the end of the question is really interesting :-) — Natan Streppel, Aug 03 '13 at 17:24
@HappyYellowFace Unfortunately none of that applies to my specific quesstion — MyNameIsKhan, Aug 03 '13 at 17:29
KMP is probably the best option for you unless you wan't to dive into the probability space. — gifnoc-gkp, Aug 03 '13 at 17:34
Can you provide example sizes for the length of the big string, size of alphabet, length of small strings, and number of small strings? (I am interested because if the small strings are small enough, it may be possible to build up a histogram of counts for all possible small strings) — Peter de Rivaz, Aug 03 '13 at 17:55
@PeterdeRivaz The small strings don't typically exceed a few hundred characters in length, the big ones can be tens of thousands long though. upper and lowercase plus space character alphabet. — MyNameIsKhan, Aug 03 '13 at 18:00
Your task is very much like [>short read alignment<](http://ai.stanford.edu/~serafim/CS374_2011/presentations/lecture10.pdf) in bioinformatics, if your big string is not that repetitive. — lcn, Aug 06 '13 at 04:08

Peter de Rivaz · Accepted Answer · 2013-08-03T19:37:52.977

The problem with most string searching algorithms is that they will take at least time O(k) to return k matches, so if you have a string with say 1 million "a"s, and 1 million queries of the little string "a", then it will take around 1 million, million iterations to count all the matches!

An alternative linear time approach would be to:

Construct a suffix tree of the big string: O(n) where n is len(big string)
Precompute the number of suffixes below each node in the suffix tree: O(n)
For each small string, find the node in the suffix tree that has the small string as a suffix: O(m) where m is len(small string)
Add to the total count the number of suffixes below that node. (Each suffix corresponds to a different match of the small string in the big string)

This will take time O(n+p) where n is the length of the big string, and p is the total length of all the small strings.

Example Code

As requested, here is some small(ish) example code in Python that uses this approach:

from collections import defaultdict

class SuffixTree:
    def __init__(self):
        """Returns an empty suffix tree"""
        self.T=''
        self.E={}
        self.nodes=[-1] # 0th node is empty string

    def add(self,s):
        """Adds the input string to the suffix tree.

        This inserts all substrings into the tree.
        End the string with a unique character if you want a leaf-node for every suffix.

        Produces an edge graph keyed by (node,character) that gives (first,last,end)
        This means that the edge has characters from T[first:last+1] and goes to node end."""
        origin,first,last = 0,len(self.T),len(self.T)-1
        self.T+=s
        nc = len(self.nodes)
        self.nodes += [-1]*(2*len(s))
        T=self.T
        E=self.E
        nodes=self.nodes

        Lm1=len(T)-1
        for last_char_index in xrange(first,len(T)):
            c=T[last_char_index]
            last_parent_node = -1                    
            while 1:
                parent_node = origin
                if first>last:
                    if (origin,c) in E:
                        break             
                else:
                    key = origin,T[first]
                    edge_first, edge_last, edge_end = E[key]
                    span = last - first
                    A = edge_first+span
                    m = T[A+1]
                    if m==c:
                        break
                    E[key] = (edge_first, A, nc)
                    nodes[nc] = origin
                    E[nc,m] = (A+1,edge_last,edge_end)
                    parent_node = nc
                    nc+=1  
                E[parent_node,c] = (last_char_index, Lm1, nc)
                nc+=1  
                if last_parent_node>0:
                    nodes[last_parent_node] = parent_node
                last_parent_node = parent_node
                if origin==0:
                    first+=1
                else:
                    origin = nodes[origin]

                if first <= last:
                    edge_first,edge_last,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_last,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first

            if last_parent_node>0:
                nodes[last_parent_node] = parent_node
            last+=1
            if first <= last:
                    edge_first,edge_last,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_last,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first
        return self


    def make_choices(self):
        """Construct a sorted list for each node of the possible continuing characters"""
        choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
        for (origin,c),edge in self.E.items():
            choices[origin].append(c)
        choices=[sorted(s) for s in choices] # should not have any repeats by construction
        self.choices=choices
        return choices


    def count_suffixes(self,term):
        """Recurses through the tree finding how many suffixes are based at each node.
        Strings assumed to use term as the terminating character"""
        C = self.suffix_counts = [0]*len(self.nodes)
        choices = self.make_choices()
        def f(node=0):
            t=0
            X=choices[node]
            if len(X)==0:
                t+=1 # this node is a leaf node
            else:
                for c in X:
                    if c==term:
                        t+=1
                        continue
                    first,last,end = self.E[node,c]
                    t+=f(end)
            C[node]=t
            return t
        return f()

    def count_matches(self,needle):
        """Return the count of matches for this needle in the suffix tree"""
        i=0
        node=0
        E=self.E
        T=self.T
        while i<len(needle):
            c=needle[i]
            key=node,c
            if key not in E:
                return 0
            first,last,node = E[key]
            while i<len(needle) and first<=last:
                if needle[i]!=T[first]:
                    return 0
                i+=1
                first+=1
        return self.suffix_counts[node]


big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
    x=s.count_matches(needle)
    print needle,'has',x,'matches'

It prints:

a has 11 matches 
ab has 1 matches 
abc has 0 matches

However, in practice I would recommend you simply use a pre-existing Aho-Corasick implementation as I would expect this to be much faster in your particular case.

I think this answer is a little too complicated for me. I don't know why it's so hard to find information on this subject online — MyNameIsKhan, Aug 03 '13 at 17:49
@AgainstASicilian: It's not hard to find this information online. Here is a good place to start: [Exact String Matching Algorithms](http://www-igm.univ-mlv.fr/~lecroq/string/index.html) — Blastfurnace, Aug 03 '13 at 18:19
@Blastfurnace That is actually the exact reference I've been using unfortunately, none apply to my specific case — MyNameIsKhan, Aug 03 '13 at 18:29
@PeterdeRivaz Can you give a small example of your suffix explanation? — MyNameIsKhan, Aug 03 '13 at 18:56
If KMP's too slow, suffix trees are going to be way too slow. — tmyklebu, Aug 04 '13 at 14:36
@AgainstASicilian: You said you only care about speed. In that case, you will need to dive deeper into the subject - simple. brute force algorithms will not be efficient. The answers given here are not complicated, just take your time to understand the details. — Zane, Aug 04 '13 at 16:31

mcdowella · Answer 2 · 2013-08-04T14:35:08.947

Matching against a large collection of strings sounds like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for me. It does find matches one at a time, so Peter de Rivaz's idea might be better if there are a huge number of matches. On the other hand, Aho-Corasick doesn't need to keep the big string in memory - you can just stream it through - and is very practical to implement and tune - the Wikipedia link notes that the orginal fgrep used it.

Thinking about it, you can work round the mega-match problem. Aho-Corasick can be viewed as creating a deterministic finite state machine just capable of recognizing each of the strings it is searching for. The state of the machine corresponds to the last N characters seen. If you wish to match two strings and one is a suffix of the other you need to be careful that when you are in the state that says you have just matched the longer string that you also recognize that this means that you have just matched the shorter string. If you deliberately choose not to do this, then the counts you accumulate for the shorter string will be incorrect - but you know that they are too low by the number of times the longer string was seen. So if you modify Aho-Corasick so that only the longest match at each point is recognized and counted, then the cost of matching remains linear in the number of characters in the string you are searching, and you can fix up the counts at the end by going through the long strings and then incrementing counts for the shorter strings which are suffixes of the long strings. This will take time at most linear in the total size of the strings being searched for.

I actually looked into Aho Corasick but could not find an implementation that worked the way I wanted it to (lots of substrings, one big main string) — MyNameIsKhan, Aug 03 '13 at 17:46
+1: Given the constraints I think this approach will work really well as there are unlikely to be too many multiple matches. There is C code [here](http://sourceforge.net/projects/multifast/files/multifast-v1.0.0/) that looks like it might do the job. — Peter de Rivaz, Aug 03 '13 at 18:19
@AgainstASicilian: Aho-Corasick works quite well with "lots of substrings, one big main string." Although it's typically used when searching *many* large strings, because building the state machine takes a little time. If your little strings change frequently and your main string is static, then Aho-Corasick probably isn't the algorithm for you. — Jim Mischel, Aug 04 '13 at 16:25

user2566092 · Answer 3 · 2013-08-03T19:26:10.577

0

Building on another answer (and hopefully convince you this is the best type of answer), you can look up http://en.wikipedia.org/wiki/Suffix_tree and also go through the references listed there to learn about suffix trees if you really want the fastest solution for your problem, that also can make it possible to get the number of matches without iterating over all the matches, and the running times and memory requirements you get are the absolute best possible for any substring matching or match counting algorithm. Once you understand how the suffix tree works and how to build/use it, then the only additional tweak you need is store the number of distinct string starting positions that are being represented at each internal node of the tree, a minor modification that you can easily do efficiently (linear time, as already claimed) by recursively getting the count from children nodes and adding them up to get the count at a current node. Then these counts allow you to count substring matches without iterating over all of them.

edited Aug 03 '13 at 19:26

answered Aug 03 '13 at 19:16

user2566092

4,631
15
20

I have no idea how to do any of this even when I read the wiki. Seriously is there any actually useful resource on this type of problem? – MyNameIsKhan Aug 03 '13 at 19:29
Try these slides: www.cs.uku.fi/~kilpelai/BSA05/lectures/slides06.pdf – user2566092 Aug 03 '13 at 19:44
@AgainstASicilian Suffix tree construction algorithms are tricky. http://en.wikipedia.org/wiki/Suffix_array can do many of the same things and have more comprehensible linear time construction algorithms - see e.g. DC3/skew. In this case you should be able to count occurrences of X by finding the first and last occurrences of X... in the array. – mcdowella Aug 05 '13 at 04:29

score 0 · Answer 4 · edited May 23 '17 at 12:10

1) I'd go with finite automata. Can't think of a specialized library right now, but the general-purpose PCRE can be used to construct an automata efficiently searching for the given substring. For substring "foo" and "bar" one can construct a pattern /(foo)|(bar)/, scan a string and get the "id" number of the matched substring by iterating the ovector checking which group matched.
RE2::FindAndConsume is good if you only need the total count, not grouping by substring.
P.S. Example using Boost.Xpressive and loading the strings from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
P.P.S. Recently I've had a good time creating a Ragel machine for a similar task. For a small set of searched strings a "normal" DFA would work, buf if you have a larger ruleset then using Ragel scanners shows good results (here is a related answer).
P.P.P.S. PCRE has the MARK keyword which is super useful for that kind of subpattern classification (cf).

2) Quite some time ago I wrote a Trie-based thingie in Scala for that kind of load: https://gist.github.com/ArtemGr/6150594; Trie.search goes over the string trying to match the current position to a number encoded in the Trie. The trie is encoded in a single cache-friendly array, I expect it to be as good as non-JIT DFAs.

3) I've been using boost::spirit for substring matching, but never got to measuring how it fares against other approaches. Spirit uses some efficient structure for the symbols matching, perhaps the structure can be used on its own without the overhead of Spirit.

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref;
static struct exact_t: qi::symbols<char, int> {
  exact_t() {add
    ("foo", 1)
    ("bar", 2);}
} exact;
int32_t match = -1;
qi::rule<const char*, int()> rule =
  +alnum >> exact [ref (match) = _1];
const char* it = haystack; // Mutable iterator for Spirit.
qi::parse (it, haystack + haystackLen, rule);

score 0 · Answer 5 · answered Aug 04 '13 at 15:24

If I understood correctly, your input string consists of many one-character blocks.

In this case, you can compress your text using the Run-length encoding.

For example:

s = aaabbbbcc

is encoded as:

encoded_s = (a3)(b4)(c2)

Now you may search for patterns in encoded text.

If you want a concrete algorithm, just search the web for patterns matching in Run-length encoded strings. You can achieve time complexity O(N + M), where N and M are the lengths of compressed text and compressed pattern. Both M and N in general are much smaller than original lengths, so it beats any standard pattern matching algorithm e.g. KMP.

Which string-finding algorithm is appropriate for this?

5 Answers5

Example Code