0

I want to get initial form of natural English words, e.g.:

'words' -> 'word'
'Jhon'  -> 'John'
'openning' -> 'open'

I have tried python Stemer lib:

st=Stemer.Stemer()
for w in ('very', 'words', 'openning'):
print st.stemWord(w),

>>>veri word open

i expect 'very' but instead got 'veri'

then nltk.corpus.wordnet lib:

from nltk.corpus import wordnet
wordnet.synsets( 'beans' )
[Synset('bean.n.01'),
 >>>Synset('bean.n.02'),
 >>>Synset('bean.n.03'),
 >>>Synset('attic.n.03'),
 >>>Synset('bean.v.01')]

it give more info but not a quick dictionary.

LancasterStemmer can not get 'english' as 'english':

from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
st.stem('english')
>>>>'engl'

enchant lib method check() and sugguest() is not suitable:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")

Any method to get quick original form, for a document text?

whi
  • 2,685
  • 6
  • 33
  • 40
  • A method would be to use machine learning. You provide the input and expected output and hope for the program to learn to recognize the correct words. – drum May 05 '14 at 02:26
  • How about [this](https://pypi.python.org/pypi/stemming/1.0) one? – aruisdante May 05 '14 at 03:04
  • Your term "initial form" is not well-defined, but it apoears that you need a *lemmatizer* rather than a stemmer. See e.g. http://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers – tripleee May 05 '14 at 03:17
  • yes, i tend to combine stem and lemma. now another problem occurs: US -> UK word, i want to find the mapping. – whi May 05 '14 at 09:56

0 Answers0