Tokenize valid words from a long string

Question

Suppose you have a dictionary that contains valid words.

Given an input string with all spaces removed, determine whether the string is composed of valid words or not.

You can assume the dictionary is a hashtable that provides O(1) lookup.

Some examples:

helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid

If a string has multiple possible parsings, just return true is sufficient.

The string can be very long. Hence think an algorithm that is both space & time efficient.

This can be done in quadratic time by dynamic programming, see [here](http://stackoverflow.com/questions/3466972/how-to-split-a-string-into-words-ex-stringintowords-string-into-words/3469228#3469228). — Falk Hüffner, Aug 24 '10 at 09:16
This is optimal for the dictionary data structure given as we could have substrings [0..i] for i=1..N/2-1 match, as well as substrings [i..N-1] for i=N/2..N-2 match. — Nabb, Aug 24 '10 at 09:24
Ah, interesting. I assumed that there might be a way to attain better worst-case bounds than my straightforward recursion approach by avoiding recomputations somehow (given the existence of CYK and all), but couldn't quite figure out how to do it. — ig2r, Aug 24 '10 at 09:33

sandris · Accepted Answer · 2010-08-24T16:30:13.087

6

I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).

For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.

For an other example, with the words {bag, bat, bun, but} the automaton would look like this: alt text

Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string. More formally, our algorithm is as follows:

Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.

Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.) I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.

edited Aug 24 '10 at 16:30

answered Aug 24 '10 at 09:43

sandris

1,363
13
27

Say if we have n valid words. How many possible automatons can be there by chaining words up (either different or the same)? I guess it's O(n!). The space complexity is unacceptable. – SiLent SoNG Aug 24 '10 at 11:13
There's no need to 'chain up words', I'll try to edit my answer :-) – sandris Aug 24 '10 at 12:22
chain of chars. nice. Can you justify the O(n) time cost? To find a complete path in the graph in O(n). – SiLent SoNG Aug 24 '10 at 15:06
@SiLent SoNG: I added some explanation to my answer. – sandris Aug 24 '10 at 16:25
This answer is very wrong, and misleading. It is equivalent to a greedy algorithm, which cannot solve the problem. For example, if your dictionary has the words "the", "there", and "is", then you must accept both strings "theis" and "thereis", but there is no way that your solution can. – behdad Dec 18 '14 at 03:21
If your dictionary is "the", "there", and "is", the automaton built in my answer *does* accept "theis" and "thereis". I'm not sure what you mean by "equivalent to a greedy algorithm". – sandris Dec 18 '14 at 13:47
@bedah: Have you actually tried it out with pen and paper? – sandris Dec 19 '14 at 09:21
@sandris Ok I read your algorithm again. It does work indeed, since you are keeping a set of states. Previously I assumed you are keeping one state. But then the problem is that your run-time analysis is incorrect. You assume that processing each input character is constant-time, whereas the size of your state set S can in fact be exponential in the number of symbols read so far, but capped by the total number of states in your machine. – behdad Dec 19 '14 at 21:51
Well, that's the whole point. S is a subset of the state set of the automaton, hence the size of S is never greater than the number of states in the automaton (which is constant in the number of input symbols read). When the input string is fully read, S will tell you whether the input string is accepted or not. It won't provide you with a parsed form. – sandris Dec 20 '14 at 20:06
What do you mean "constant in the number of input symbols read"?! Your algorithm cannot be linear time, that's what I'm trying to say. WHat am I missing? – behdad Dec 21 '14 at 19:53
It is linear time, as proven in the answer, where's the error? (The size of S is not greater than the number of states in the automaton, and the number of states in the automation does not depend on the length of the input string.) The formal language in question is regular, and the membership problem for regular languages is solvable in linear time. I'm not sure what you are missing. – sandris Dec 22 '14 at 09:56
Fine, you win the argument. But your algorithm has a gigantic constant, which depends on the size of the dictionary. In other words, if the length of the input string is n and the number of words in the dictionary is N, then your algorithm is not O(n). It's a horrible approach to this simple problem. The problem can be solved in O(n^2) with no dependence on N, and that's definitely much much much much faster than your solution in the worst case. – behdad Jan 04 '15 at 19:03

ig2r · Answer 2 · 2010-08-24T07:39:33.213

0

I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.

Works like this:

If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.

For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.

EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.

Space complexity would be only stack space, so O(n) even in the worst-case.

edited Aug 24 '10 at 07:39

answered Aug 24 '10 at 06:51

ig2r

2,396
1
16
17

Can you clarify on the claim "average case O(n)"? – SiLent SoNG Aug 24 '10 at 06:59
Hm, come to think about it, that average O(n) may have been a misjudgment on my part. For instance, if the algorithm sees the prefix 'a' in the dictionary (assuming a dictionary of English words for a minute), it would first try to tokenize the remaining input... likely adding the *entire* suffix character-by-character before deciding that this cannot be tokenized and reverting to expanding the 'a'. Looks more like something polynomial, then. Considering Cocke-Younger-Kasami has something like O(n^3), too... good question, then. – ig2r Aug 24 '10 at 07:36
In worst case, the algorithm will attempt each possible text span to justify whether it is a valid word or not. There are n^2 possible text spans, hence worst case O(n^2). – SiLent SoNG Aug 24 '10 at 08:07
The worst case is O(2^N), as we have T(0) = O(1), T(N) = sum(T[i], 0..N-1). – Nabb Aug 24 '10 at 08:52
@Nabb: An careful implementation will not go exponential. The algorithm keeps on asking the 2nd item: is the remaining substring satisfies the property? If previously we have attempted this substring, don't compute it again. There are totally n^2 substrings, and hence in worst case O(n^2). – SiLent SoNG Aug 24 '10 at 11:10
@SiLent SoNG: "If previously we have attempted this substring, don't compute it again." - this solution is not doing this. You can add this by keeping a cache of previously attempted substrings. This is basically memoization and is actually the same as the Dynamic Programming solution suggested by @Falk Hüffner. Only a DP implementation would be do away with the recursion. – MAK Aug 24 '10 at 11:29

Tokenize valid words from a long string

2 Answers2

Linked