method to find the shortest substring containing the given words:optimization required

Question

I have a program that requires me to find the shortest subsegment of a given String, containing a list of words. Even though my program is correct, I am failing to deliver within a time frame of execution(5 seconds). I figured the problem is because of the complex(trivial) algorithm I am using. It consists of nested loops, and requires multiple scan of the list_of_words array.Here is my code for the search function. a[] contains the original string,stored by words, b[] contains the list of the words which is to be found to form the subsegment. String gstores the temporary subsegment formed by words from original string including the words in the list.

private static void search() // Function to find the subsegment with a required list of words
{
   int tail,x;//counters 
   String c[]=new String[b.length]; //initializing a temporary array to copy the list of words array.

   for(int i =0; i<a.length;i++)// looping throw original string array
    {
       System.arraycopy(b, 0, c, 0, b.length);//copying the list of words array to the temporary array

        for (int j=0;j<b.length;j++)//looping throw the temporary array
        { 
            x=0; //counter for temporary array

            if(a[i].equalsIgnoreCase(b[j]))//checking for a match with list of words
            {
                tail=i;
//adds the words starting from the position of the first found word(tail) till all the words from the list are found
                while((x<b.length)&&(tail<a.length))

                {
                    g=g+" "+a[tail];//adds the words in the string g

                    for(int k=0;k<c.length;k++) //checks for the found word from the temporary array and replaces it with ""    
                    {
                        if(c[k].equalsIgnoreCase(a[tail]))
                        {
                            c[k]=""; x++;
                        }
                    }
                    tail++;

                }
                if((tail==a.length)&&(x!=b.length))//checks if the string g does not contains all the listed word
                {
                    g="";
                }
                else
                    {
                    count(g);g="";//check for the shortest string.
                    }
            }
        }

    }print();
}

Sample :

Original String :This is a test. This is a programming test. a programming test this is.

Words to be found : this , test, a ,programming.

Subsegments :

This is a test This is a programming

This is a programming test

a programming test a programming test this

programming test a programming test this

test a programming test this

a programming test this

Shortest Sub segment : a programming test this

Any suggestion regarding change in the data structures or looping structures or even changes in the algorithm that optimizes the same will be helpful.

A little confused on the problem you are trying to solve, can you explain with an example perhaps. Also a regex may help you — aaronman, May 03 '13 at 22:02
Are the words need to be in order? The array with the words is by data input or you split strings into words arrays? — ssedano, May 03 '13 at 22:03
@ssedano yes, in the order they appear in the original string. — Sushim Mukul Dutta, May 03 '13 at 22:06
Are you saying the point is to find the shortest subsequence in a string that includes all the words in the list given — aaronman, May 03 '13 at 22:16
@aaronman yes. The words need to print the shortest subsegment which contains all the words from the list. — Sushim Mukul Dutta, May 03 '13 at 22:21
This is a tough problem to do fast but it should have a better than brute force approach, btw I'm pretty sure all the current given answers are wrong — aaronman, May 03 '13 at 22:37
There are some problems i noticed, for example the word "programming" contains the word "a", how is the string supposed to be separated, do they have to be separated by some sort of punctuation. If you were allowed to put multiple punctuations in between words it could really change the algorithm. — aaronman, May 03 '13 at 22:42
@aaronman I mentioned words, not characters. Words in a string are separated by spaces I presume. :) — Sushim Mukul Dutta, May 03 '13 at 22:47
@aaronman just for information, I have already separated each word of the original string and stored in `a[]` using regex(also removed any special characters: that was one of the requirements of the question). I need help with the algorithm to find the subsegment, which is not flawed but also not very efficient( more then 5 secs in some cases :( ). — Sushim Mukul Dutta, May 03 '13 at 23:19
Looks like someone is brainstorming for their Amazon Entrance test: https://amazon.interviewstreet.com/challenges/dashboard/#problem/4fd648244715d — WernerCD, Nov 11 '13 at 14:24

Bernhard Barker · Accepted Answer · 2013-05-04T01:38:21.090

Dynamic Programming solution:

Have a last position variable for each of the words you're looking for.

Have a total count of distinct seen words you're looking for (will never decrease, max = count of words you're looking for).

For each word position in the input:

If the word exists in the list of words you're looking for, update the last position of that word.
Increase the total count if the updated last position was not initialized.
If the total count is equal to the max, loop through the last positions and find the smallest one. The distance between the current position and that value will be the length of the substring. Record these values and find the best one over all positions.

An optimization is to have a heap for last positions to reduce the time taken to find the smallest one (should be used together with some structure (possibly a hash- or tree-map) that allows fast lookup of pointers into the heap given a word).

Example:

Input: This is a test. This is a programming test. a programming test this is

Looking for: this, test, a, programming

                1    2  3  4     5    6  7  8           9     10 11          12   13   14
                This is a  test. This is a  programming test. a  programming test this is
this         -1 1    1  1  1     5    5  5  5           5     5  5           5    13   13
test         -1 -1   -1 -1 4     4    4  4  4           9     9  9           12   12   12
a            -1 -1   -1 3  3     3    3  7  7           7     10 10          10   10   10
programming  -1 -1   -1 -1 -1    -1   -1 -1 8           8     8  11          11   11   11
Count        0  1    1  2  3     3    3  3  4           4     4  4           4    4    4
Substr len   NA NA   NA NA NA    NA   NA NA 5           5     6  7           8    4    5
Shortest len NA NA   NA NA NA    NA   NA NA 5           5     5  5           5    4    4

Best result: a programming test this, length = 4.

Complexity Analysis:

Let n be the number of words in the input and k the number of words we're looking for.

The algorithm only does one pass through the input and, at each step, does O(log k) work for the getMin operation (with the heap optimization).

Thus the total time taken is O(n log k).

Dealing with duplicates:

If duplicates are allowed in the words we're looking for (and the target sequence must match all occurrences), the algorithm above won't work as is, but a simple fix is to have each distinct word have its own heap of pointers into the original heap (the value in this heap being the same as the value in the original heap), with the maximum size of this heap being the number of occurrences of that word in the words we're looking for.

Thanks a lot. This algorithm uses less then half the number of scan which the my naive solution required. Drastically reduces the execution time. — Sushim Mukul Dutta, May 04 '13 at 15:27

score 4 · Answer 2 · answered May 03 '13 at 22:47

Here's the implementation that occurs to me.

//Implementing here with two List<String>
//Should be easy enough to use arrays, or streams, or whatever.
public static int getShortestSubseqWith(List<String> text, List<String> words) {
    int minDistance = Integer.MAX_VALUE;
    //Create a map of the last known position of each word
    Map<String, Integer> map = new HashMap();
    for (String word : words) {
        map.put(word, -1);
    }
    String word;
    //One loop through the main search string
    for (int position = 0; position < text.size(); position++){
        word = text.get(position);
        //If the current word found is in the list we're looking for
        if (map.containsKey(word)) {
            //Update the map
            map.put(word, position);
            //And if the current positions are the closest seen so far, update the min value.
            int curDistance = getCurDistance(map);
            if (curDistance < minDistance)
                minDistance = curDistance;
        }
    }
    return minDistance;
}

//Get the current distance between the last known position of each value in the map
private static int getCurDistance(Map<String, Integer> map) {
    int min = Integer.MAX_VALUE;
    int max = 0;
    for (Integer value : map.values()) {
        if (value == -1)
            return Integer.MAX_VALUE;
        else {
            max = Math.max(max,value);
            min = Math.min(min,value);
        }
    }
    return max - min;
}

The main performance influence here, if hits are relatively sparse, and list of terms to search for relatively small, should just be the loop over the text to be searched. If hits are very frequent, performance may suffer, due to more frequent runs through getCurDistance.

My function fails(times out) because of the large number of the subsegments that are generated. Still I think,using list and the tree traversal should reduce the time instead of my poor array scanning, in which case should solve my problem. Thanks a lot :). Will mark the answer correct after checking the solution. ;) — Sushim Mukul Dutta, May 03 '13 at 23:28
This code does not fully answer the question. It returns the minimum distance found, but cannot print the actual matched sequence since *map* might get updated after containing the shortest sequence. — Barry NL, May 03 '13 at 23:34

Barry NL · Answer 3 · 2013-05-03T22:55:36.557

Another approach might be to map each word in b[] to its occurrence indices in a[].

Map<Integer, List<Integer>> occurrence = new HashMap<Integer, List<Integer>>();
for(int idx = 0; idx < a.length; idx++)
{
  int bIdx = ... retrieve the index of the word a[idx] in b or -1 if it doesn't exist;

  if(bIdx >= 0)
  {
    List<Integer> bIdxOccurs = occurrence.get(bIdx);
    //some code to initially create the lists
    bIdxOccurs.add(idx);
  }
}

And then find a combination of occurrences from each word in the map whose indices are closest to each other. The naive way would be to generate every combination and compare the distance between the smallest and biggest index, but there might be a faster way. I have to think about that...

Finally, take every word from a[] that lies between the found smallest and biggest index of the shortest sequence.

nullptr · Answer 4 · 2013-05-03T22:44:50.037

1

String[] a; // larger string
String[] b; // list of words to search

int index = -1;

for (int i = 0; i < a.length - b.length; i++)
{
    HashSet<String> set = new HashSet<String>(b.length);
    for (String s : b)
        set.add(s);

    boolean found = true;

    for (int j = 0; j < b.length; j++)
    {
        if (set.contains(a[i+j]))
            set.remove(a[i+j]);
        else
        {
            found = false;
            break;
        }
    }
    if (found)
    {
        index = i;
        break;
    }
}

If you can stand having multiple instances of a given word, it gets easier. This assumes that each word in b is unique.

edited May 03 '13 at 22:44

answered May 03 '13 at 22:02

nullptr

2,244
1
15
22

the words are in any order. not necessarily in sequence. I have given the sample input in the comments if you may check once. – Sushim Mukul Dutta May 03 '13 at 22:16
2

This is not an answer to the question. As I understand it, non-matching words are allowed between matching ones. This implementation would never find 'match match non-match match match' although it might be the shortest sequence. – Barry NL May 03 '13 at 23:22
This is not a solution to the question asked by the person. – Akshay Aug 21 '15 at 02:34

score 1 · Answer 5 · edited May 23 '17 at 11:59

I could see this problem as an alternative of minimum window width problem. Instead of characters, it is words here.

Its almost same as solution given by Dukeling. The only add-on is the use of LinkedHashMap for tracking the words found in the order. A java solution can be found here.

Here is my python implementation


import collections
def minsubstring(sentence, words):
    sentence = sentence.split(' ')
    mintillnow = sentence
    words = set(words.split(' '))
    found = collections.defaultdict(lambda : [-1,-1])#position of word in the sentence and order of the word
    linked = [] # this together with 'found' provides the functionality of LinkedHashMap
    for i, word in enumerate(sentence):
        if word in words:
            found[word][0] = i
            if found[word][1] != -1:#if this word is already seen, remove it from linked list
                del(linked[found[word][1]])
            linked.append(word)#append the newly found word to the tail of the linked list
            # probably the worst part in this code, updating the indexes back to the map
            for i, wword in enumerate(linked):
                found[wword][1] = i
            # if found all the words, then check if the substring is smaller than the one till now and update
            if len(linked) == len(words):
                startPos = found[linked[0]][0]
                endPos = found[linked[-1]][0]
                if (endPos - startPos + 1) &lt len(mintillnow):
                    mintillnow = sentence[startPos:endPos + 1]
    return ' '.join(mintillnow)

Test result


>>> minsubstring('This is a test. This is a programming test. a programming test this is. ','this test a programming')
'a programming test this'

score 0 · Answer 6 · answered May 03 '13 at 23:21

0

I think you may be able to do it by having a head and a tail pointer keep moving one inwards until you no longer have a match then do the same for the other and repeat the whole process until it won't go inwards anymore. I may try to code it later.

answered May 03 '13 at 23:21

aaronman

18,343
7
63
78

did the same :P, that was my algorithm if you refer the code. The problem is for some test cases the scan of list_of_words(`b[]`) is taking lot of time. – Sushim Mukul Dutta May 03 '13 at 23:23
Hmm, this alg shoulndt be that slow – aaronman May 03 '13 at 23:30
@SushimMukulDutta, if scanning the list of your target words is taking too long, you should put them in a query structure (such as a trie). – comingstorm May 04 '13 at 00:03
@comingstorm would a Hashtable and Set will produce a similar result? as using a trie? – Sushim Mukul Dutta May 04 '13 at 00:12
They should help; give them a try. However, tries are specifically designed for string lookup, while `Hashtable` and `Set<>` are more generic. – comingstorm May 04 '13 at 00:33

score 0 · Answer 7 · answered May 04 '13 at 00:57

I'll take a stab at outlining a more efficient algorithm.

Don't concatenate the string. Instead count characters as you add, i.e. length() + 1 for each word.

For the sublist save starting word, ending word, character count.

When a shorter list is found, replace the above values.

Write a method to find the first sublist starting with a specific element and returning the above definitions for the sublist (start, end, char count).

Invoke above method using the first word. Save values. Invoke method using starting word+ 1. Rinse and repeat saving shorter values when found.

You can even improve on that by using the fact that the first word in the sublist has to be one of your search words. Starting at start + 1 you can simply look for that element rather than all elements since it is the only one missing (still need to use all to find first matching word). If you find it prior to the ending word in the sublist you have a smaller sublist. If you find it after the ending word it is the new ending.

That's a lot more complicated but potentially a lot faster. A common tradeoff.

score 0 · Answer 8 · answered Apr 29 '14 at 07:39

public final class MaxStringWindow {

    private MaxStringWindow() {}

    private static void addStringCount(Map<String, Integer> map, String str) {
        if (!map.containsKey(str)) {
            map.put(str, 1);
        } else {
            int val = map.get(str);
            map.put(str, val + 1);
        }
    }

    private static Map<String, Integer> toFindMap(List<String> strList) {
        final Map<String, Integer> toFind  = new HashMap<String, Integer>();
        for (String stri : strList) {
            addStringCount(toFind, stri);
        }
        return toFind;
    }


    public static int minWindowSize(String sentence, List<String> strList) {
        final Map<String, Integer> toFind = toFindMap(strList);
        final Map<String, Integer> hasFound  = new HashMap<String, Integer>();

        int matchCtr = 0;
        boolean matchFound = false;
        String currLeftMostString = null;

        int j = 0; // the trailing position of the sliding window
        int i = 0; // the leading position of the sliding window.

        int min = Integer.MAX_VALUE;

        String[] words = sentence.split(" "); 

        for (i = 0; i < words.length; i++) {

            if (!toFind.containsKey(words[i])) {
                continue;
            }

            if (!matchFound) {
                currLeftMostString = words[i];
                matchFound = true;
                j = i;  
            }

            addStringCount(hasFound, words[i]);

            matchCtr++;

            // check if match has been completed.
            if (matchCtr >= strList.size()) {
                if ((i - j + 1) < min) {
                    min = i - j + 1;
                }
            }

            // does the first element exceed value ?
            if (hasFound.get(currLeftMostString) > toFind.get(currLeftMostString)) {
                // advance the left pointer, such the window (i-j) is as small as possible.    
                while (!toFind.containsKey(words[j]) || hasFound.get(words[j]) > toFind.get(words[j])) {
                    if (hasFound.containsKey(words[j])) {
                        int val = hasFound.get(words[j]);
                        hasFound.put(words[j], val - 1);
                    } 
                    j++;
                }
                currLeftMostString = words[j];
            }   
        }


        if (matchCtr < strList.size()) {
            throw new IllegalArgumentException("The subset is not found in the input string.");
        }

        // note: here we dont do (i-j+1) since i has been incremented additionally in a for loop.
        return min > (i - j) ? i - j : min;
    }

}

method to find the shortest substring containing the given words:optimization required

8 Answers8