Split a string to a string of valid words using Dynamic Programming

Question

I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem:

You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w is a valid word, and has value 0 otherwise.

Give a dynamic programming algorithm that determines whether the string s[*] can be reconstituted as a sequence of valid words. The running time should be at most O(n^2), assuming that each call to dict takes unit time.
In the event that the string is valid, make your algorithm output the corresponding sequence of words.

Well, it is an exercise from a textbook. I don't have the solutions to the exercises, and I am not sure how to solve this problem. — Pet, Mar 15 '11 at 11:10
First thing that comes to mind - ambiguity. Suppose your dictionary has words 'was', 'her' and 'washer'. You can just prefer shortest words, though. Wait, no... you can chop part from word and render string invalid - like catch 'auto' from 'automatic'. — alxx, Mar 15 '11 at 11:17
Did you try to search for answer first? There are already few questions about this problem on SO - http://stackoverflow.com/questions/4755157/split-string-into-words , http://stackoverflow.com/questions/3553958/tokenize-valid-words-from-a-long-string , http://stackoverflow.com/questions/3466972/how-to-split-a-string-into-words-ex-stringintowords-string-into-words . Some of them mention dynamic programming solutions. — hoha, Mar 15 '11 at 11:38

score 9 · Answer 1 · answered Mar 15 '11 at 11:39

Let the length of your compacted document be N.

Let b(n) be a boolean: true if the document can be split into words starting from position n in the document.

b(N) is true (since the empty string can be split into 0 words). Given b(N), b(N - 1), ... b(N - k), you can construct b(N - k - 1) by considering all words that start at character N - k - 1. If there's any such word, w, with b(N - k - 1 + len(w)) set, then set b(N - k - 1) to true. If there's no such word, then set b(N - k - 1) to false.

Eventually, you compute b(0) which tells you if the entire document can be split into words.

In pseudo-code:

def try_to_split(doc):
  N = len(doc)
  b = [False] * (N + 1)
  b[N] = True
  for i in range(N - 1, -1, -1):
    for word starting at position i:
      if b[i + len(word)]:
        b[i] = True
        break
  return b

There's some tricks you can do to get 'word starting at position i' efficient, but you're asked for an O(N^2) algorithm, so you can just look up every string starting at i in the dictionary.

To generate the words, you can either modify the above algorithm to store the good words, or just generate it like this:

def generate_words(doc, b, idx=0):
  length = 1
  while true:
    assert b(idx)
    if idx == len(doc): return
    word = doc[idx: idx + length]
    if word in dictionary and b(idx + length):
       output(word)
       idx += length
       length = 1

Here b is the boolean array generated from the first part of the algorithm.

Isn't it inefficient if you consider all words that start at character N - k - 1. A better method would be `b[i] = true if there exists i <= j < N such that dict(s[i..j]) and b[j+1..N-1]` — Minh Pham, Mar 07 '13 at 02:12

score 6 · Answer 2 · answered Jul 03 '13 at 19:12

To formalize what @MinhPham suggested.

This is a dynammic programming solution.

Given a string str, let

b[i] = true if the substring str[0...i] (inclusive) can be split into valid words.

Prepend some starting character to str, say !, to represent the empty word. str = "!" + str

The base case is the empty string, so

b[0] = true.

For the iterative case:

b[j] = true if b[i] == true and str[i..j] is a word for all i < j

score 1 · Answer 3 · answered Mar 15 '11 at 11:56

1

The O(N^2) Dp is clear but if you know the words of the dictionary, i think you can use some precomputations to get it even faster in O(N). Aho-Corasick

answered Mar 15 '11 at 11:56

mariusgherman

29
6

hidayat · Answer 4 · 2011-03-15T13:25:32.057

A dp solution in c++:

int main()
{
    set<string> dict;
    dict.insert("12");
    dict.insert("123");
    dict.insert("234");
    dict.insert("12345");
    dict.insert("456");
    dict.insert("1234");
    dict.insert("567");
    dict.insert("123342");
    dict.insert("42");
    dict.insert("245436564");
    dict.insert("12334");

    string str = "123456712334245436564";

    int size = str.size();
    vector<int> dp(size+1, -1);
    dp[0] = 0;
    vector<string > res(size+1);
    for(int i = 0; i < size; ++i)
    {
        if(dp[i] != -1)
        {
            for(int j = i+1; j <= size; ++j)
            {
                const int len = j-i;
                string substr = str.substr(i, len);
                if(dict.find(substr) != dict.end())
                {
                    string space = i?" ":"";
                    res[i+len] = res[i] + space + substr;
                    dp[i+len] = dp[i]+1;
                }
            }
        }
    }
    cout << *dp.rbegin() << endl;
    cout << *res.rbegin() << endl;

    return 0;
}

Why don't you give the description what what you have done and explain why you have done so? — Minh Pham, Mar 07 '13 at 01:57
Just some random code doesn't help anyone. You should provide an explanation. — adijo, Jun 13 '14 at 06:05

score 0 · Answer 5 · answered Sep 27 '16 at 23:35

Below is an O(n^2) solution for this problem.

void findstringvalid() {
string s = "itwasthebestoftimes";
set<string> dict;
dict.insert("it");
dict.insert("was");
dict.insert("the");
dict.insert("best");
dict.insert("of");
dict.insert("times");

vector<bool> b(s.size() + 1, false);
vector<int> spacepos(s.size(), -1);
//Initialization phase
b[0] = true; //String of size 0 is always a valid string
for (int i = 1; i <= s.size(); i++) {
    for (int j = 0; j <i; j++) {
       //string of size s[ j... i]
       if (!b[i]) {
           if (b[j]) {
              //check if string "j to i" is in dictionary
              string temp = s.substr(j, i - j);
              set<string>::iterator it = dict.find(temp);
              if (it != dict.end()) {
                  b[i] = true;
                  spacepos[i-1] = j;
              }
           }
        }
    }
}
if(b[s.size()])
    for (int i = 1; i < spacepos.size(); i++) {
        if (spacepos[i] != -1) {
            string temp = s.substr(spacepos[i], i - spacepos[i] + 1);
            cout << temp << " ";
    }
    }
}

Your dictionary doesn't include all possible words in the string. For example "a", "as" and "he" are all valid words that can be found in this substring. — Phil Glau, May 17 '19 at 04:12

score 0 · Answer 6 · answered Mar 15 '11 at 11:35

The string s[] can potentially be split into more than one ways. The method below finds the maximum number of words in which we can split s[]. Below is the sketch/pseudocode of the algorithm

bestScore[i] -> Stores the maximum number of words in which the first i characters can be split (it would be MINUS_INFINITY otherwise)

for (i = 1 to n){
     bestScore[i] = MINUS_INFINITY
     for (k = 1 to i-1){
        bestScore[i] = Max(bestSCore[i], bestScore[i-k]+ f(i,k))
     }
 }

Where f(i,k) is defined as:

f(i,k) = 1 : if s[i-k+1 to i] is in dictionary
       = MINUS_INFINITY : otherwise

bestScore[n] would store the maximum number of words in which s[] can be split (if the value is MINUS_INFINIY, s[] cannot be split)

Clearly the running time is O(n^2)

As this looks like a textbook exercise, I will not write the code to reconstruct the actual split positions.

Split a string to a string of valid words using Dynamic Programming

6 Answers6

Linked