I'm trying to create a simple parser for some text data. (The text is in a language that NLTK doesn't have any parsers for.)
Basically, I have a limited number of prefixes, which can be either one or two letters; a word can have more than one prefix. I also have a limited number of suffixes of one or two letters. Whatever's in between them should be the "root" of the word. Many words will have more the one possible parsing, so I want to input a word and get back a list of possible parses in the form of a tuple (prefix,root,suffix).
I can't figure out how to structure the code though. I pasted an example of one way I tried (using some dummy English data to make it more understandable), but it's clearly not right. For one thing it's really ugly and redundant, so I'm sure there's a better way to do it. For another, it doesn't work with words that have more than one prefix or suffix, or both prefix(es) and suffix(es).
Any thoughts?
prefixes = ['de','con']
suffixes = ['er','s']
def parser(word):
poss_parses = []
if word[0:2] in prefixes:
poss_parses.append((word[0:2],word[2:],''))
if word[0:3] in prefixes:
poss_parses.append((word[0:3],word[3:],''))
if word[-2:-1] in prefixes:
poss_parses.append(('',word[:-2],word[-2:-1]))
if word[-3:-1] in prefixes:
poss_parses.append(('',word[:-3],word[-3:-1]))
if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
return poss_parses
>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
... parses = parser(w)
... print w
... for p in parses:
... print p
...
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')