Efficient Data Structure For Substring Search?

Question

Assume I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purpose of this question substring includes equality, e.g. "foo" is a substring of "foo".) For example assume the function that does what I want is called anySubstring:

S = ["foo", "baz"]
q = "foobar"
assert anySubstring(S, q)  # "foo" is a substring of "foobar"

S = ["waldo", "baz"]
assert not anySubstring(S, q)

Is there any easy-to-implement algorithm to do this with time complexity sublinear in len(S)? It's ok if S has to be processed into some clever data structure first because I will be querying each S with a lot of q strings, so the amortized cost of this preprocessing might be reasonable.

EDIT: To clarify, I don't care which member of S is a substring of q, only whether at least one is. In other words, I only care about a boolean answer.

Are the strings in S short? How does the length of the longest string in S compare with the length of S? — aelguindy, Mar 09 '12 at 16:04

score 15 · Answer 1 · answered Mar 09 '12 at 16:15

15

I think Aho-Corasick algorithm does what you want. I think there is another solution which is very simple to implement, it's Karp-Rabin algorithm.

answered Mar 09 '12 at 16:15

aelguindy

3,703
24
31

1

I cannot believe I am the only person to upvote the right answer. – Nemo Mar 10 '12 at 16:23
And there's a nice #Python library for Aho-Corasick https://pypi.org/project/pyahocorasick/ ! :-D very useful answer! – alisa Aug 06 '19 at 21:10

score 2 · Answer 2 · answered Mar 09 '12 at 15:25

2

So if the length of S is way less then the sum of the lengths of the potential substrings your best option would be to build a suffix tree from S and then do a search in it. This is linear with respect to the length of S plus the summar length of the candidate substrings. Of course there can not be an algorithm with better complexity as you have to pass through all the input at least. If the case is opposite i.e. the length of s is more then the summar length of the substrings your best option would be aho-corasick.

Hope this helps.

answered Mar 09 '12 at 15:25

Ivaylo Strandjev

69,226
18
123
176

It fits for the other way around - querying many times with a single `q`, but in here - the `S` is constant, and `q` is changing... – amit Mar 09 '12 at 15:29
2

@amit I think Aho-Corasick does what OP wants.. It operates much like your solution except that the trie has failure transitions that take you to the right position in the automaton. – aelguindy Mar 09 '12 at 16:08
@aelguindy: I was speaking only about the suffix tree approach, I am afraid I am not familiar with the mentioned algorithm, but I will read how it works later today. – amit Mar 09 '12 at 16:11
@amit Yes, you are right regarding your comment about the suffix tree, I though you're commenting on the entire answer, sorry :-). – aelguindy Mar 10 '12 at 17:08

score 1 · Answer 3 · answered Mar 09 '12 at 22:23

1

Create a regular expression .*(S1|S2|...|Sn).* and construct its minimal DFA.

Run your query string through the DFA.

answered Mar 09 '12 at 22:23

Dimple Kapadia

27
1

@SaeedAmiri If OP does not care at all about the efficiency of the preprocessing, then once the DFA is created, all queries run in time linear in the length of q, independent of any aspect of S. – aelguindy Mar 10 '12 at 17:07
@aelguindy, suffix tree works in similar Order with less constant factor and less memory, in fact you should convert your DFA finally to something like trie. – Saeed Amiri Mar 10 '12 at 17:36
@SaeedAmiri How do you check that your query string has the suffix tree string as a substring? Suffix trees should be built when the text is fixed not the patterns.. – aelguindy Mar 10 '12 at 20:26

Efficient Data Structure For Substring Search?

3 Answers3

Linked

Related