Optimizing construction of a trie over all substrings

Question

I am solving a trie related problem. There is a set of strings S. I have to create a trie over all substrings for each string in S. I am using the following routine:

String strings[] = { ... }; // array containing all strings
for(int i = 0; i < strings.length; i++) {
    String w = strings[i];
    for (int j = 0; j < w.length(); j++) {
        for (int k = j + 1; k <= w.length(); k++) {
            trie.insert(w.substring(j, k));
        }
    }
}

I am using the trie implementation provided here. However, I am wondering if there are certain optimizations which can be done in order to reduce the complexity of creating trie over all substrings?

Why do I need this? Because I am trying to solve this problem.

kajacx · Answer 1 · 2015-07-01T17:39:17.313

If we have N words, each with maximum length L, your algorithm will take O(N*L^3) (supposing that adding to trie is linear with length of adding word). However, the size of the resulting trie (number of nodes) is at most O(N*L^2), so it seems you are wasting time and you could do better.

And indeed you can, but you have to pull a few tricks from you sleeve. Also, you will no longer need the trie.

.substring() in constant time

In Java 7, each String had a backing char[] array as well as starting position and length. This allowed the .substring() method to run in constant time, since String is immutable class. New String object with same backing char[] array was created, only with different start position and length.

You will need to extend this a bit, to support adding at the end of the string, by increasing the length. Always create a new string object, but leave the backing array same.

Recompute hash in constant time after appending single character

Again, let me use Java's hashCode() function for String:

int hash = 0;
for (int i = 0; i < data.length; i++) {
    hash = 31 * hash + data[i];
} // data is the backing array

Now, how will the hash change after adding a single character at the end of the word? Easy, just add it's value (ASCII code) multiplied by 31^length. You can keep powers of 31 in some separate table, other primes can be used as well.

Store all substring in single HashMap

With using tricks 1 and 2, you can generate all substrings in time O(N*L^2), which is the total number of substrings. Just always start with string of length one and add one character at a time. Put all your strings into a single HashMap, to reduce duplicities.

(You can skip 2 and 3 and discard duplicities when/after sorting, perhaps it will be even faster.)

Sort your substrings and you are good to go.

Well, when I got to point 4, I realized my plan wouldn't work, because in sorting you need to compare strings, and that can take O(L) time. I came up with several attempts to solve it, among them bucket sorting, but none would be faster than original O(N*L^3)

I will just this answer here in case it inspires someone.

In case you don't know Aho-Corasic algorithm, take look into that, it could have some use for your problem.

I have heard of the algorithm, I will get back to you after reading it. I am not sure if it is a good fit for the problem. — Bhoot, Jul 01 '15 at 16:53
I just realized this answer is wrong, wait a minute until I rewrite it. — kajacx, Jul 01 '15 at 17:00

score 2 · Answer 2 · answered Jul 03 '15 at 13:02

What you need may be suffix automaton. It costs only O(n) time and can recognize all substrings.

Suffix array can also solve this problems.

These two algorithms can solve most string problems, and they are really hard to learn. After you learn those you will solve it.

score 1 · Answer 3 · answered Jul 01 '15 at 16:35

1

You may consider the following optimization:

Maintain list of processed substrings. While inserting a substring, check if the processed set contains that particular substring and if yes, skip inserting that substring in the trie.

However, the worst case complexity for insertion of all substrings in trie will be of the order of n^2 where n is the size of strings array. From the problem page, this works out to be of the order of 10^8 insertion operations in trie. Therefore, even if each insertion takes 10 operations on an average, you will have 10^9 operations in total which sets you up to exceed the time limit.

The problem page refers to LCP array as a related topic for the problem. You should consider change in approach.

answered Jul 01 '15 at 16:35

n00bc0d3r

275
2
10

1

I am not being condescending but considering your handle, this answer is pretty ironic. :) – Bhoot Jul 01 '15 at 16:37
@Bhoot: Haha! No offense taken. – n00bc0d3r Jul 01 '15 at 16:38
I would suggest to implement the set of added substrings as some sort of `HashSet`, since you can recompute the hash for string when adding or removing a single letter in constant time. – kajacx Jul 01 '15 at 16:53

score 1 · Answer 4 · edited May 23 '17 at 11:45

First, notice that it is enough to add only suffixes to the trie, and nodes for every substring will be added along the way.

Second, you have to compress the trie, otherwise it will not fit into memory limit imposed by HackerRank. Also this will make your solution faster.

I just submitted my solution implementing these suggestions, and it was accepted. (the max execution time was 0.08 seconds.)

But you can make your solution even faster by implementing a linear time algorithm to construct the suffix tree. You can read about linear time suffix tree construction algorithms here and here. There is also an explanation of the Ukkonen's algorithm on StackOverflow here.

Optimizing construction of a trie over all substrings

4 Answers4