algorithm to parse string with dictionary

Question

Given

a dictionary full of words {in, july, den, dentist, best, ...} with some C++ API to access it: boolean findWord(string word), or string getNextWord(void) to iterate through it,
some input string with no space, e.g.: bestdentistinjuly...

Output

best dentist in july is... (basically separate the non-space string by space given a dictionary)

What will be the best algorithm to solve it?

A subtle but important question is, is there any fancy way to solve the unreachable dead-end problem. E.g., den and dentist are both valid words to dissect the rest of the string, one of them may just be a dead-end.

To me it seems like a greedy problem or something solvable by dynamic programming..

findWord() is the only implemented method? could you iterate through dictionary entities? — om-nom-nom, Jul 20 '11 at 18:07
Best algorithm to solve what? So far, we have a dictionary and a string. What are you trying to do? — Fantius, Jul 20 '11 at 18:09
Are you trying to split the string into known words (best+dentist+july in your example)? If so, there could be multiple solutions. — Fantius, Jul 20 '11 at 18:10
What should be the output of the algorithm? Whats the return value of findWord? — Christian Ammer, Jul 20 '11 at 18:11
Ok, so you have to decide how you would like to split the string "dental" given that all of these are words: den, dent, dental, al. Do you always want to take the longest word possible (greedy)? — Fantius, Jul 20 '11 at 18:22
Also, do you know at the start that the input string is composed only of actual words? — Fantius, Jul 20 '11 at 18:24
@Fantius, you can assume the input string is always valid; Greedy is one of the solution I guess. Not sure if there's something more efficient. — Figo, Jul 20 '11 at 18:30
It's not a question of what is more efficient, it's a question of the requirements. Would you want the output to be "dental" or "dent+al"? Or both? — Fantius, Jul 20 '11 at 18:34
@Fantius, I want all possible validly separated string out of it. "dent+al" will not be one. — Figo, Jul 20 '11 at 18:36
see also http://stackoverflow.com/questions/3983978/add-spaces-between-words-in-spaceless-string — AShelly, Jul 20 '11 at 19:47
@Figo: if you want all possible solutions, you should bear in mind that there could be exponentially (compared with the length of the string) many solutions. — Neil G, Jul 20 '11 at 21:52

score 2 · Answer 1 · edited May 23 '17 at 10:27

2

Use a Trie to store the dictionary. You can see a simple implementation (C#) at How to create a trie in c#

You're going to need to do a search because you don't know if you are on the right track until you have considered the whole input string. You'll need to iterate through the input string, at the same time as descending into the trie. When you get to a terminal node of the trie, you have a branch in your search process: you need to both treat that as the end of a word and treat it as the first part of a longer word.

edited May 23 '17 at 10:27

Community

1
1

answered Jul 20 '11 at 18:54

Fantius

3,806
5
32
49

The dictionary implementation isn’t relevant here, the interface for that is given anyway. – Konrad Rudolph Jul 20 '11 at 18:57
1

If that dictionary only allows iterating and calling findWord(), he may with to iterate through the whole thing and build his own trie since a trie would be a far more efficient way to solve the problem. It depends on the total length of all of the input strings that are expected to be processed during the lifetime of the dictionary. – Fantius Jul 20 '11 at 19:06
This is the correct answer. @Fantius: You should also make it clear that the branching is done breadth-first (all branches are iterated simultaneously) rather than depth-first. – Neil G Jul 20 '11 at 21:49
Thanks for my only upvote :) But I don't see the problem with doing it depth-first. OP wants all solutions. Depth-first is simpler and, in general, uses less memory (although the branching factor here will be small). – Fantius Jul 21 '11 at 12:39
@Fantius: You're right. In my mind, breadth-first would be better if you wanted to find one solution (if two partial solutions ever converge, one can be discarded, either randomly or according to some metric.) – Neil G Jul 21 '11 at 19:07
I'm sorry to be such a contrarian here but... If I wanted one solution and I didn't care which, I would use depth-first and stop searching when I found one. If I wanted to pick the best solution (by some metric) out of all of the solutions, I would use depth-first and as I reached a solution, use the metric to compare it to the previous solution, keeping only one. – Fantius Jul 22 '11 at 01:13
@Fantius: The DFS is exponential time (definitely true if you judge solutions using a metric), whereas the BFS with discarding is polynomial time (but if you have a metric, it has to be able to be applied to partial solutions.) – Neil G Jul 22 '11 at 18:41

Ricky Bobby · Answer 2 · 2011-07-20T19:22:15.143

You can create a kind of word tree:

You can go throught the string with no space. Once you find a word in your list, you add a space and you continu... until you cannot go further.

Then you go back to the previous word and try to se if adding new letter you can create a word, and if you can continu from their.

You try this until you tried all the possiblities.

If you go back to the starting word and you don't find any solution => no sol

Here is the algorithm ( my pseudocode syntax is not good, but you can get the general idea. I believe you will have to correct it a little):

TreeWordResult //Tree that keeps the results in memory

Go through the InputString:

      If you find a word in the InputDictionnary 
          Then add this word to the last node of the treeWordResult
      Otherwise:
          while (No word found):
                go back in the treeWordResult
                try to find word in InputDictionnary different from the one before (on the other node)
          endwhile
          if no word found:
                     return NO SOLUTION
          otherwise:
                     continue going through word
          endif
       endif
 return Leaf

Algorithm ends when you find no sol, or when your at a "leaf" (you went thhrough the whole string)

Here is an illustration using your example:

enter image description here

Hope my explaination is clear.

this algorithm may also match "best dent is tin...". You need another step to find the *best* match. Maybe a markov tree of representative text. — AShelly, Jul 20 '11 at 19:26
If (best,dent,is,tin) are words of the dictionnary it can match it with this solution too. tin is not in the dictionnary here so it won't do it. But if it was, you can continue the algorithm, starting again from the root to find all the possible solutions, and chose the best among these solution. You will eventually end with a markov tree of all the possible solutions. — Ricky Bobby, Jul 20 '11 at 19:38
@Neil G, you're right, I would need a function such as AShelly described to have a much more efficient search, and have my algorithm ends in a reasonnable time. Storing the dictionnary as a trie would help it a lot. — Ricky Bobby, Jul 21 '11 at 11:16

score 1 · Answer 3 · answered Jul 20 '11 at 19:04

1

This problem is basically like matching a regex of the form:

(in|july|den|dentis|best|...)*

So any regex algorithm may be used. Which you should choose depends on what will be the size of the dictionary and the length of the input. You should probably start here: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times

answered Jul 20 '11 at 19:04

Paweł Obrok

22,568
8
74
70

This will not give you all solutions, only the greedy solution. – Fantius Jul 20 '11 at 19:07
OP seems to have wanted the longest match – Paweł Obrok Jul 20 '11 at 19:09
In comments he states, "I want all possible validly separated string out of it." – Fantius Jul 20 '11 at 20:23

score 1 · Answer 4 · answered Jul 20 '11 at 19:40

I think you could make it faster if you could have the findWord method return different values for 'not-a-word' and 'no-words-starting-with-this-prefix'. This would be easy if the dictionary was stored as a trie.

The reason is that if you are checking words as in @Ricky Bobby's answer, after you find 'best', you still need to check 'bestd' and 'bestde' and so on all the way to the end of the string. However if the check for 'bestd' returns 'no-longer-words', then you have trimmed out a whole bunch of searching.

+1 You're right, the way i test things make the complexityof my algorithm huge . — Ricky Bobby, Jul 20 '11 at 19:45

score 0 · Answer 5 · answered Jul 20 '11 at 18:20

0

It is unclear from original post on what exactly to solve. I am guessing that @Figo is looking for something similar to a string matching algorithm.

A very good resource for this: http://igm.univ-mlv.fr/~lecroq/string/

answered Jul 20 '11 at 18:20

Bhaskar

2,549
1
21
23

The string matching isn’t the problem. The problem is getting the best *combination of matches* from a given position onwards. – Konrad Rudolph Jul 20 '11 at 18:37

Christian Ammer · Answer 6 · 2011-07-20T20:00:15.877

0

Maybe there are more than one valid solutions to separate the input string. You could use a backtracking algorithm to just find one or all valid solutions. On positions where two or more words e.g. "den", "dentist" are possible, the algorithm should try the longer words first.

The dictionary of course should be stored into a Trie to quickly find the matching words.

In the following ascii image the left branch would be examined first in a Depth-first search which prefers longer words. A solution would be found before the algorithm looks at the right branch with the word "den".


        Best
        /   \
   dentist  den
      /
     in
    /
  july

edited Jul 20 '11 at 20:00

answered Jul 20 '11 at 18:47

Christian Ammer

7,464
6
51
108

The backtracking algorithm could have exponential running time. Consider the string `aaaaaaaaa...` and the dictionary `{a, aa}`. Of course if all the solutions are to be found then there is no going around this. – Paweł Obrok Jul 20 '11 at 19:14
If you implement the backtracking as [Depth-first search](http://en.wikipedia.org/wiki/Depth-first_search) and finish after the first valid solution (in case the OP doesn't need all solutions) it's not exponential! – Christian Ammer Jul 20 '11 at 19:41
This is only true if you want any solution and not even the longest one - in that case the empty solution works just as well. – Paweł Obrok Jul 20 '11 at 20:42
@obrok: What do you mean with "the longest one", and why do you think OP wants to get this? – Christian Ammer Jul 21 '11 at 11:59
There are two possible cases: either matching only a substring is acceptable (then the longest soultion is the one that matches most characters) or substring matches are not valid solutions in which case a DFS is still exponential because choices made earlier may make you hit a dead end before the end of the string. – Paweł Obrok Jul 21 '11 at 12:27
@obrok: You start from conditions OP explicitly excluded, look at his comment: "...you can assume the input string is always valid...". Also DFS is not exponential in the worst case the performance is O(|V|+|E|) regarding Wikipedia :-) – Christian Ammer Jul 21 '11 at 19:40

algorithm to parse string with dictionary

6 Answers6

Linked