InterviewStreet Find Strings

Question

You are given 'n' strings w1, w2, ......, wn. Let Si denote the set of strings formed by considering all unique substrings of the string wi. A substring is defined as a contiguous sequence of one or more characters in the string. More information on substrings can be found here. Let 'S' = {S1 U S2 U .... Sn} .i.e 'S' is a set of strings formed by considering all the unique strings in all sets S1, S2, ..... Sn. You will be given many queries and for each query, you will be given an integer 'k'. Your task is to output the lexicographically kth smallest string from the set 'S'.

Input:

The first line of input contains a single integer 'n', denoting the number of strings. Each of the next 'n' lines consists of a string. The string on the ith line (1<=i<=n) is denoted by wi and has a length mi. The next line consists of a single integer 'q', denoting the number of queries. Each of the next 'q' lines consists of a single integer 'k'.

Note: The input strings consist only of lowercase english alphabets 'a' - 'z'.

Output:

Output 'q' lines, where the ith line consists of a string which is the answer to the ith query. If the input is invalid ('k' > |S|), output "INVALID" (quotes for clarity) for that case.

Constraints:

1<=n<=50
1<=mi<=2000
1<=q<=500
1<=k<=1000000000

https://www.interviewstreet.com/challenges/dashboard/#problem/4efa210eb70ac

My approach

For each input string, generate its substrings and add them to a set, which will automatically eliminate duplicates and keep them sorted.Return element at index i in set.

I have written a code on above approach here:

http://justprogrammng.blogspot.com/2012/06/interviewstreet-find-strings-solution-c.html

But Problem I am facing is

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

This error is appearing in few test cases. Can someone please tell me why I am getting this error and how should I rectify this error?

That means you ran out of heap memory. Haven't read it thoroughly yet, but it looks like you might be storing a hell of a lot of strings, if you store every possible substring. — BoBTFish, Jun 29 '12 at 14:23
one need to store these substrings somewhere(or set) if one has to finally print the string at index i from lexicographic ordered set of substrings — sachin, Jun 29 '12 at 14:33
Kinda busy at the moment I'm afraid, don't have time to play with this. How many words are in the testcase? For each query, you can probably work out how many letters it will have, and just generate the substrings of that length, sort those, and pick the answer from them? Not sure if this is viable. — BoBTFish, Jun 29 '12 at 14:37
It looks like you generate all substrings of each string. That is `O(length^2)` memory for each string, so probably way too much. — IVlad, Jun 29 '12 at 14:56
@IVlad so you think that it's substring generation which is causing out of memory error? What do you think about set which stores stings, may be that is causing error as BoBTFish pointed out? — sachin, Jun 29 '12 at 15:06
Yes, this is what is causing your error: `for(j=0;j0;k--) { temp = s1.substr(j,k); s.insert(temp); }` — IVlad, Jun 29 '12 at 15:08
why is no one giving good suggestion, what shud i do to fix the error? — sachin, Jun 29 '12 at 15:34
If I find any spare time this evening or tomorrow this looks quite interesting. My first thoughts turn out to be wrong (i.e. my previous suggestion about only generating those of the correct length can't take into account ignoring repeats.) — BoBTFish, Jun 29 '12 at 15:36
Adding substrings to a set is not fast enough. You can solve this problem using suffix array: http://en.wikipedia.org/wiki/Suffix_array — deebee, Jun 29 '12 at 18:22

score 3 · Answer 1 · answered Jan 13 '13 at 02:49

@enjay's answer is correct. I'll elaborate for anyone who is new to this kind of string processing algorithm problems and want to learn more. My answer will depict the large picture and give pointers to any detail mentioned.

The problem @sachin pointed out in interviewstreet.com belongs to a large group of problem where substring, palindrome, etc is involved. All such problem can be solved by one dedicated data structure: suffix array(en.wikipedia.org/wiki/Suffix_array). The complete learning path is as below. But if you are only intreated in solving the problem, you can go directly to 3.

Trie(http://en.wikipedia.org/wiki/Trie). The foundation for suffix tree.
Suffix tree(http://en.wikipedia.org/wiki/Suffix_tree). Put all suffix of one string into a Trie. Observe that any substring of a given string is some prefix of one of the given string's suffix. The idea of suffix tree is inspiring, but since the construction of suffix tree is either too complex to implement or too slow, in practice, I doubt any one use it. However, for anyone want to challenge the unnecessary difficult, here the best illustration of construction algorithm for suffix tree:http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
Suffix array(http://en.wikipedia.org/wiki/Suffix_array). Suffix array contains any information suffix tree have(thus can do whatever suffix tree can do) and is way easier to implement. That being said, if you want to accomplish anything non-trivial with it, you need to construct at least three array and one index for RMQ. The three arrays are:

a. The suffix array itself.

b. The rank array.

c. The height array.

Since one common task when using suffix array is to find the longest common prefix of two suffix, one need to do RMQ query to the height array. RMQ is described here: http://en.wikipedia.org/wiki/Range_Minimum_Query

score 2 · Answer 2 · answered Jun 29 '12 at 15:48

You get bad_alloc error because there are too many unique substrings to fit in memory. To rectify it you may use any approach where there is no need to store every unique substring.

My solution is pretty complicated, so I provide just a sketch of an algorithm.

It is possible to store only those substrings, that are beginning at every possible position in w1, w2, ......, wn, and ending at the end of w1, w2, ......, wn. And instead of the whole substring, you can store only pointer to its starting character.

To answer the queries, you can sort all the queries, sort all the substrings. Then iterate the substrings in the following manner: take all the substrings starting from the same character, then take all the substrings with the same second character and so on. In other words, you implicitly construct a trie with all inner nodes having weight 1 and all leaf nodes having weight, equal to the length of the remaining unique substring, corresponding to this node. And you iterate through this trie, computing cumulative sum of each node weight, and comparing it with the next not-yet-processed query. As soon as you find the match, you print the substring and continue with trie traversal.

All this requires not much memory but is very hungry for computation resources. But it may be optimized. You may use an explicit trie to store all short substrings (probably with lengths 1 .. 2 or 1 .. 3). Also you may use bucket sort algorithm to sort longer substrings.

@gsingh2011: the substrings that end at every possible position are not stored anywhere. Otherwise there may be not enough memory to hold every substring. While we do not store every substring, we may count them as second half of this post explains. Sorted substrings, described here, are, in fact, a [Generalized suffix array](http://en.wikipedia.org/wiki/Suffix_array). Alternative approach is to use a [Generalized suffix tree](http://en.wikipedia.org/wiki/Generalised_suffix_tree) and count nodes in order of depth-first search, but it needs somewhat more memory. — Evgeny Kluev, Oct 30 '12 at 18:46

score 2 · Answer 3 · answered Aug 10 '12 at 14:11

use a suffix array...it will be faster and easier on memory than genarating all substrings... if u sort the suffix array and then try searching using some augmenting data structures like an lcp array and cumulative-substring-count array u can solve it in time and within memory constraints....

InterviewStreet Find Strings

3 Answers3