Counting English words in a random string

Question

Suppose I have a randomly generated string s=t&^%JHGgfdteam*&HGEdfg, what is the best approach to count the number of English words in that string? (English words as defined in some dictionary file). Obviously brute-force is not a good idea... would a suffix-tri e work? Binary search? Notice that in the case of s, there are two words: "tea" and "team". Any ideas? Regards

"ged" is also an English word. – John Rasch Sep 08 '10 at 03:21 — John Rasch, Sep 08 '10 at 03:21
"t&^%J" is also an English word. Oh, wait... – Dan Tao Sep 08 '10 at 03:23 — Dan Tao, Sep 08 '10 at 03:23
"Ed" is also an English word. – Leniel Maccaferri Sep 08 '10 at 03:24 — Leniel Maccaferri, Sep 08 '10 at 03:24

NullUserException · Accepted Answer · 2010-09-08T03:22:17.787

I would load the dictionary words in a Trie structure, then read the string from left to right and check if the substrings are in the trie. If they are and there are children, keep going. If they happen to be a leaf or a valid word, add to the occurence count.

In pseudo code:

Trie dict = ... // load dictionary
Dictionary occurences = {}

for i in length(string):
    j = i + 1
    # think of partial as string.Substring(i, j);
    while dict.hasChildren(partial):
        j++ 
        if isWord(partial):
            dict[partial]++

This way you'll guarantee it doesn't miss a match while still looking for all possibilities.

You can limit the minimum length of the valid words by changing what j is initialized to or by rejecting short words in the isWord() method (so a wouldn't be a "valid" word).

This should be more than enough to start with. Thanks! – Dervin Thunk Sep 08 '10 at 03:26 — Dervin Thunk, Sep 08 '10 at 03:26

score 6 · Answer 2 · edited Sep 08 '10 at 04:42

6

The Aho-Corasick string matching algorithm builds the matching structure in time linear in the size of the dictionary and matches patterns at time linear in the size of the input text + number of matches found.

edited Sep 08 '10 at 04:42

James McNellis

348,265
75
913
977

answered Sep 08 '10 at 04:39

mcdowella

19,301
2
19
25

+1: A trie is good, but a trie + a good search algorithm is far better. – James McNellis Sep 08 '10 at 04:42

Counting English words in a random string

2 Answers2