21

I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem:

You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w is a valid word, and has value 0 otherwise.

  1. Give a dynamic programming algorithm that determines whether the string s[*] can be reconstituted as a sequence of valid words. The running time should be at most O(n^2), assuming that each call to dict takes unit time.
  2. In the event that the string is valid, make your algorithm output the corresponding sequence of words.
hippietrail
  • 15,848
  • 18
  • 99
  • 158
Pet
  • 211
  • 1
  • 2
  • 3
  • 2
    Well, it is an exercise from a textbook. I don't have the solutions to the exercises, and I am not sure how to solve this problem. – Pet Mar 15 '11 at 11:10
  • First thing that comes to mind - ambiguity. Suppose your dictionary has words 'was', 'her' and 'washer'. You can just prefer shortest words, though. Wait, no... you can chop part from word and render string invalid - like catch 'auto' from 'automatic'. – alxx Mar 15 '11 at 11:17
  • 3
    Did you try to search for answer first? There are already few questions about this problem on SO - http://stackoverflow.com/questions/4755157/split-string-into-words , http://stackoverflow.com/questions/3553958/tokenize-valid-words-from-a-long-string , http://stackoverflow.com/questions/3466972/how-to-split-a-string-into-words-ex-stringintowords-string-into-words . Some of them mention dynamic programming solutions. – hoha Mar 15 '11 at 11:38

6 Answers6

9

Let the length of your compacted document be N.

Let b(n) be a boolean: true if the document can be split into words starting from position n in the document.

b(N) is true (since the empty string can be split into 0 words). Given b(N), b(N - 1), ... b(N - k), you can construct b(N - k - 1) by considering all words that start at character N - k - 1. If there's any such word, w, with b(N - k - 1 + len(w)) set, then set b(N - k - 1) to true. If there's no such word, then set b(N - k - 1) to false.

Eventually, you compute b(0) which tells you if the entire document can be split into words.

In pseudo-code:

def try_to_split(doc):
  N = len(doc)
  b = [False] * (N + 1)
  b[N] = True
  for i in range(N - 1, -1, -1):
    for word starting at position i:
      if b[i + len(word)]:
        b[i] = True
        break
  return b

There's some tricks you can do to get 'word starting at position i' efficient, but you're asked for an O(N^2) algorithm, so you can just look up every string starting at i in the dictionary.

To generate the words, you can either modify the above algorithm to store the good words, or just generate it like this:

def generate_words(doc, b, idx=0):
  length = 1
  while true:
    assert b(idx)
    if idx == len(doc): return
    word = doc[idx: idx + length]
    if word in dictionary and b(idx + length):
       output(word)
       idx += length
       length = 1

Here b is the boolean array generated from the first part of the algorithm.

  • Isn't it inefficient if you consider all words that start at character N - k - 1. A better method would be `b[i] = true if there exists i <= j < N such that dict(s[i..j]) and b[j+1..N-1]` – Minh Pham Mar 07 '13 at 02:12
6

To formalize what @MinhPham suggested.

This is a dynammic programming solution.

Given a string str, let

b[i] = true if the substring str[0...i] (inclusive) can be split into valid words.

Prepend some starting character to str, say !, to represent the empty word. str = "!" + str

The base case is the empty string, so

b[0] = true.

For the iterative case:

b[j] = true if b[i] == true and str[i..j] is a word for all i < j

mingxiao
  • 1,712
  • 4
  • 21
  • 33
1

The O(N^2) Dp is clear but if you know the words of the dictionary, i think you can use some precomputations to get it even faster in O(N). Aho-Corasick

1

A dp solution in c++:

int main()
{
    set<string> dict;
    dict.insert("12");
    dict.insert("123");
    dict.insert("234");
    dict.insert("12345");
    dict.insert("456");
    dict.insert("1234");
    dict.insert("567");
    dict.insert("123342");
    dict.insert("42");
    dict.insert("245436564");
    dict.insert("12334");

    string str = "123456712334245436564";

    int size = str.size();
    vector<int> dp(size+1, -1);
    dp[0] = 0;
    vector<string > res(size+1);
    for(int i = 0; i < size; ++i)
    {
        if(dp[i] != -1)
        {
            for(int j = i+1; j <= size; ++j)
            {
                const int len = j-i;
                string substr = str.substr(i, len);
                if(dict.find(substr) != dict.end())
                {
                    string space = i?" ":"";
                    res[i+len] = res[i] + space + substr;
                    dp[i+len] = dp[i]+1;
                }
            }
        }
    }
    cout << *dp.rbegin() << endl;
    cout << *res.rbegin() << endl;

    return 0;
}
hidayat
  • 9,493
  • 13
  • 51
  • 66
0

Below is an O(n^2) solution for this problem.

void findstringvalid() {
string s = "itwasthebestoftimes";
set<string> dict;
dict.insert("it");
dict.insert("was");
dict.insert("the");
dict.insert("best");
dict.insert("of");
dict.insert("times");

vector<bool> b(s.size() + 1, false);
vector<int> spacepos(s.size(), -1);
//Initialization phase
b[0] = true; //String of size 0 is always a valid string
for (int i = 1; i <= s.size(); i++) {
    for (int j = 0; j <i; j++) {
       //string of size s[ j... i]
       if (!b[i]) {
           if (b[j]) {
              //check if string "j to i" is in dictionary
              string temp = s.substr(j, i - j);
              set<string>::iterator it = dict.find(temp);
              if (it != dict.end()) {
                  b[i] = true;
                  spacepos[i-1] = j;
              }
           }
        }
    }
}
if(b[s.size()])
    for (int i = 1; i < spacepos.size(); i++) {
        if (spacepos[i] != -1) {
            string temp = s.substr(spacepos[i], i - spacepos[i] + 1);
            cout << temp << " ";
    }
    }
}
  • 1
    Your dictionary doesn't include all possible words in the string. For example "a", "as" and "he" are all valid words that can be found in this substring. – Phil Glau May 17 '19 at 04:12
0

The string s[] can potentially be split into more than one ways. The method below finds the maximum number of words in which we can split s[]. Below is the sketch/pseudocode of the algorithm

bestScore[i] -> Stores the maximum number of words in which the first i characters can be split (it would be MINUS_INFINITY otherwise)

for (i = 1 to n){
     bestScore[i] = MINUS_INFINITY
     for (k = 1 to i-1){
        bestScore[i] = Max(bestSCore[i], bestScore[i-k]+ f(i,k))
     }
 }

Where f(i,k) is defined as:

f(i,k) = 1 : if s[i-k+1 to i] is in dictionary
       = MINUS_INFINITY : otherwise

bestScore[n] would store the maximum number of words in which s[] can be split (if the value is MINUS_INFINIY, s[] cannot be split)

Clearly the running time is O(n^2)

As this looks like a textbook exercise, I will not write the code to reconstruct the actual split positions.