10

I have some text:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?

Erotemic
  • 4,806
  • 4
  • 39
  • 80
  • You could encode your dictionary of words as a trie and use a greedy algorithm: pull the longest word that matches and then go on to the next word, backtracking on failure. Probably not optimal. Try this for recommendations on data structures: http://kmike.ru/python-data-structures/ – hughdbrown Mar 12 '13 at 15:14
  • Interesting question. I'd guess the answer ("easy way") is going to be "no". – Tim Pietzcker Mar 12 '13 at 15:15
  • Similar question asked before that didn't have much luck: http://stackoverflow.com/questions/13034330/how-to-separate-an-engilsh-language-string-without-spaces-to-form-some-meaningfu – Jon Clements Mar 12 '13 at 15:15
  • For example, how would your algorithm know that it's not `be roughly divide din to`? They are all correct English words... – Tim Pietzcker Mar 12 '13 at 15:21
  • @Tim Pietzcker: Because that is not the greedy approach. "Greed, for lack of a better word, is good. Greed is right. Greed works." http://en.wikipedia.org/wiki/Gordon_Gekko#.22Greed_is_Good.22_quotation – hughdbrown Mar 12 '13 at 15:34
  • see here: http://stackoverflow.com/questions/3466972/how-to-split-a-string-into-words-ex-stringintowords-string-into-words – Vsevolod Dyomkin Mar 12 '13 at 15:35
  • You may refer to this solution http://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words – kiriloff Sep 02 '14 at 06:38

2 Answers2

9

Greedy approach using trie

Try this using Biopython (pip install biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

Results

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

Caveats

There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.

Obligatory test

>>> main("expertsexchange")
experts
exchange
hughdbrown
  • 47,733
  • 20
  • 85
  • 108
1

This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).

Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.

alvas
  • 115,346
  • 109
  • 446
  • 738