Is there an efficient algorithm to extract from an array of string only those which contain a specific substring?

Question

Imagine there is a big array of Strings S. From that array I need to get only those strings which contain a specific substring. For example if my array is String s [] = {"hello world", "back to hell", "say hello world"}; and my keyword is "hello", then it should return me first and last element. I tried using KMP and Boyer-Moor algorithms to check each string in array whether it contains a substring or not, but it takes too much time. Then I learned about Aho-Corasick algorithm. I am still looking it up, but seemingly it needs an array of substrings and one big string to match,while what I want is exactly opposite. So I was looking for a suggestions on how to modify Aho-Corasick algorithm for my purposes, or another means to achieve those. Would be thankful for any suggestions.

Is this an operation that you will do once, or many times? In other words, is it OK to have an expensive operation to build a data structure that can then be used for this search many times? — btilly, Dec 23 '19 at 20:37

jrook · Accepted Answer · 2019-12-23T23:20:39.127

Build a suffix tree using Ukkonen algorithm or the one suggested in this source(PDF):

McCreight’s algorithm can be easily adapted to build the generalized suffix tree for a set S={s1, s2, . . . , s_k} of strings of total length N in O(N) time...

Then use the created suffix tree to search for a given pattern. The problem is to find all occurrences of pattern P (length m) in a suffix tree T. According to the above source:

The pattern matching problem can be solved in optimal O(m+k) time ..., where k is the number of occurrences of P in T

Note that the length of the text (or the number of strings in the array) does not affect search efficiency. Therefore, you could pay for constructing a suffix tree once and then use it many times to search efficiently for short pattern strings.

EDIT: if you are in a hurry and do not mind a little bit of extra time complexity, you can construct suffix arrays instead of suffix trees using this approach(PDF) in just O(n*log^2(n)) with a very small piece of code. Here is the core idea of this approach:

The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2^k long prefixes.

And here is the pseudo-code reproduced from the above source:

n ←length(T) 
  for i←0 : n – 1
    P(0, i)← position of T(i) in the ordered array of T‘s characters 
cnt ← 1 
for k←1 : [log2n] (ceil)
  for i←0 : n – 1
    L(i)← (P(k – 1, i), P(k – 1, i + cnt), i)             
    sort L
    compute P(k, i) , i = 0, n - 1 
    cnt←2 * cnt

After running this code, P will contain the suffix array. Search using this approach is also straightforward:

Since the suffix array offers the order of T’s suffixes, searching a string P into T is easily done with a binary search. Since comparing is done in O(|P|)

My answer was written independently of this and is essentially the same answer except that instead of an external link and "...can easily be adapted to..." I provided running proof of concept code. — btilly, Dec 23 '19 at 22:43
@btilly I think an external link and a general explanation is sufficient to answer the question in this case as the OP has not also gone into much detail and has not posted any code sample. That being said, there is nothing wrong with a more detailed answer that includes code. — jrook, Dec 23 '19 at 22:47
External links as answers are discouraged because of link rot. Also the "can be easily adapted to" is a little misleading. The adaptation is conceptually more complex and only straightforward if you have a deep understanding of the original algorithm. I wouldn't expect most programmers to be able to understand the algorithm well enough to do the adaptation. — btilly, Dec 23 '19 at 23:00
@btilly this is not a *link-only* answer by any means. The core idea, time complexities, and the name of the algorithms have been given so the OP can search more and find what works best for them. I agree with your point about *"can be easily adapted"*. It is part of the quote in the source and not my own opinion. — jrook, Dec 23 '19 at 23:06
@btilly IMHO, there are many tutorials and guides for construction and using suffix trees and I do not find it necessary to reproduce those algorithms in an answer to a general question. However I find no problem with an answer that does so. — jrook, Dec 23 '19 at 23:09

Is there an efficient algorithm to extract from an array of string only those which contain a specific substring?

1 Answers1