7

I'm trying to create a simple parser for some text data. (The text is in a language that NLTK doesn't have any parsers for.)

Basically, I have a limited number of prefixes, which can be either one or two letters; a word can have more than one prefix. I also have a limited number of suffixes of one or two letters. Whatever's in between them should be the "root" of the word. Many words will have more the one possible parsing, so I want to input a word and get back a list of possible parses in the form of a tuple (prefix,root,suffix).

I can't figure out how to structure the code though. I pasted an example of one way I tried (using some dummy English data to make it more understandable), but it's clearly not right. For one thing it's really ugly and redundant, so I'm sure there's a better way to do it. For another, it doesn't work with words that have more than one prefix or suffix, or both prefix(es) and suffix(es).

Any thoughts?

prefixes = ['de','con']
suffixes = ['er','s']

def parser(word):
    poss_parses = []
    if word[0:2] in prefixes:
        poss_parses.append((word[0:2],word[2:],''))
    if word[0:3] in prefixes:
        poss_parses.append((word[0:3],word[3:],''))
    if word[-2:-1] in prefixes:
        poss_parses.append(('',word[:-2],word[-2:-1]))
    if word[-3:-1] in prefixes:
        poss_parses.append(('',word[:-3],word[-3:-1]))
    if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
    if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
        poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
    return poss_parses



>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
...   parses = parser(w)
...   print w
...   for p in parses:
...     print p
... 
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')
larapsodia
  • 594
  • 4
  • 15

3 Answers3

4

Pyparsing wraps the string indexing and token extracting into its own parsing framework, and allows you to use simple arithmetic syntax to build up your parsing definitions:

wordlist = ['construct','destructer','constructs','deconstructs']

from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo

endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)

word = (ZeroOrMore(prefix)("prefixes") + 
        SkipTo(suffix | endOfString)("root") + 
        Optional(suffix)("suffix"))

for wd in wordlist:
    print wd
    res = word.parseString(wd)
    print res.dump()
    print res.prefixes
    print res.root
    print res.suffix
    print

The results are returned in a rich object called ParseResults, which can be accessed as a simple list, as an object with named attributes, or as a dict. The output from this program is:

construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct


destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']

constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']

deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
2

Here is my solution:

prefixes = ['de','con']
suffixes = ['er','s']

def parse(word):
    prefix = ''
    suffix = ''

    # find all prefixes
    found = True
    while found:
        found = False
        for p in prefixes:
            if word.startswith(p):
                prefix += p
                word = word[len(p):] # remove prefix from word
                found = True

    # find all suffixes
    found = True
    while found:
        found = False
        for s in suffixes:
            if word.endswith(s):
                suffix = s + suffix
                word = word[:-len(s)] # remove suffix from word
                found = True

    return (prefix, word, suffix)

print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')

Result:

>>> 
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')

The idea is to loop through all prefixes and aggregate them, and at the same time remove them from the word. The tricky part is that the order in which the prefixes are defined may hide prefixes from being found, so the iterations must be re-invoked until all prefixes are found.

The same goes for suffixes, except that we build the suffix word in reverse order.

Israel Unterman
  • 13,158
  • 4
  • 28
  • 35
  • You used the flag `found` to break out of the outer `while` loop. I would prefer using the fist recipe from [here](http://stackoverflow.com/questions/6920206/sending-stopiteration-to-for-loop-from-outside-of-the-interator): for-else-continue-break construct. Still, it's disputable and a matter of taste. What do you think? – ovgolovin Apr 14 '12 at 19:51
  • I'm with you in this case. Since this a language construct and idiom, it makes the code esier to understand. The minute you glance at the `else` of the `for` statement, you understand what the meaning is behind the code. The way I coded it requires more effort to grasp. I should make more use of this scheme myself! :-) – Israel Unterman Apr 14 '12 at 20:37
  • I think the other one will work better for my purposes, but thank you very much! – larapsodia Apr 14 '12 at 21:06
2

CodeChords man beat me to this one, but as mine gives the prefixes and suffixes as tuples (which may be more or less useful given the context), and uses recursion, I thought I'd post it anyway.

class Parser():
    PREFIXES = ['de', 'con']
    SUFFIXES = ['er', 's']
    MINUMUM_STEM_LENGTH = 3

    @classmethod
    def prefixes(cls, word, internal=False):
        stem = word
        prefix = None
        for potential_prefix in cls.PREFIXES:
            if word.startswith(potential_prefix):
                prefix = potential_prefix
                stem = word[len(prefix):]
                if len(stem) >= cls.MINUMUM_STEM_LENGTH:
                    break
                else:
                    prefix = None
                    stem = word
        if prefix:
            others, stem = cls.prefixes(stem, True)
            others.append(prefix)
            return (others, stem) if internal else (reversed(others), stem)
        else:
            return [], stem

    @classmethod
    def suffixes(cls, word):
        suffix = None
        stem = word
        for potential_suffix in cls.SUFFIXES:
            if word.endswith(potential_suffix):
                suffix = potential_suffix
                stem = word[:-len(suffix)]
                if len(stem) >= cls.MINUMUM_STEM_LENGTH:
                    break
                else:
                    suffix = None
                    stem = word
        if suffix:
            others, stem = cls.suffixes(stem)
            others.append(suffix)
            return others, stem
        else:
            return [], stem

    @classmethod
    def parse(cls, word):
        prefixes, word = cls.prefixes(word)
        suffixes, word = cls.suffixes(word)
        return(tuple(prefixes), word, tuple(suffixes))

words = ['con', 'deAAers', 'deAAAers', 'construct', 'destructer', 'constructs', 'deconstructs', 'deconstructers']

parser = Parser()
for word in words:
    print(parser.parse(word))

Which gives us:

((), 'con', ())
(('de',), 'AAer', ('s',))
(('de',), 'AAA', ('er', 's'))
(('con',), 'struct', ())
(('de',), 'struct', ('er',))
(('con',), 'struct', ('s',))
(('de', 'con'), 'struct', ('s',))
(('de', 'con'), 'struct', ('er', 's'))

This works by taking the word, and using the str.startswith() function to find prefixes. It does this recursively until it is reduced to a word with no prefixes, then passes back the list of prefixes.

It then does a similar thing for suffixes, except using str.endswith() for obvious reasons.

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • This looks great! I just tested it with some real data, and it seems to work. The only problems I see is that, with some short words that also happen to be valid prefixes, it parses it as a prefix without a stem. Is there any way to modify this to force a minimum length for the stem? – larapsodia Apr 14 '12 at 21:02