0

I have tried to lemmatize a words from Quran Holy Book, but some words can't be lemmatized.

here's my sentence:

sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"

that sentence is part of my txt dataset. as you can see, there's "surahs" which is a plural form of "surah". I've tried my codes:

def lemmatize(self, ayat):
    wordnet_lemmatizer = WordNetLemmatizer()
    result = []

    for i in xrange (len(ayat)):
        result.append(wordnet_lemmatizer.lemmatize(sentence[i],'v'))
    return result

which when I run and print, the result is like this:

['bring', 'ten', 'surahs', 'like', u'invent', 'call', 'upon', 'assistance', 'whomever', 'besides', 'Allah', 'truthful']

the 'surahs' isn't changed into 'surah'.

anybody can tell why? thanks.

sang
  • 375
  • 2
  • 8
  • 23
  • There is nothing wrong with the wordnetlemmatizer per se but it just can't handle irregular words well enough. You could try this 'hack' - https://stackoverflow.com/questions/22333392/stemming-some-plurals-with-wordnet-lemmatizer-doesnt-work – SUBHAM MAJAVADIYA Jun 05 '17 at 05:11
  • I've tried that hack but it returns none [] – sang Jun 05 '17 at 05:21

1 Answers1

1

See

For most non-standard English word, WordNet Lemmatizer is not going to help much in getting the correct lemma, try a stemmer:

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('surahs')
u'surah'

Also, try the lemmatize_sent in earthy (an nltk wrapper, "shameless plug"):

>>> from earthy.nltk_wrappers import lemmatize_sent
>>> sentence = "Then bring ten surahs like it that have been invented and call upon for assistance whomever you can besides Allah if you should be truthful"
>>> lemmatize_sent(sentence)
[('Then', 'Then', 'RB'), ('bring', 'bring', 'VBG'), ('ten', 'ten', 'RP'), ('surahs', 'surahs', 'NNS'), ('like', 'like', 'IN'), ('it', 'it', 'PRP'), ('that', 'that', 'WDT'), ('have', 'have', 'VBP'), ('been', u'be', 'VBN'), ('invented', u'invent', 'VBN'), ('and', 'and', 'CC'), ('call', 'call', 'VB'), ('upon', 'upon', 'NN'), ('for', 'for', 'IN'), ('assistance', 'assistance', 'NN'), ('whomever', 'whomever', 'NN'), ('you', 'you', 'PRP'), ('can', 'can', 'MD'), ('besides', 'besides', 'VB'), ('Allah', 'Allah', 'NNP'), ('if', 'if', 'IN'), ('you', 'you', 'PRP'), ('should', 'should', 'MD'), ('be', 'be', 'VB'), ('truthful', 'truthful', 'JJ')]

>>> words, lemmas, tags = zip(*lemmatize_sent(sentence))
>>> lemmas
('Then', 'bring', 'ten', 'surahs', 'like', 'it', 'that', 'have', u'be', u'invent', 'and', 'call', 'upon', 'for', 'assistance', 'whomever', 'you', 'can', 'besides', 'Allah', 'if', 'you', 'should', 'be', 'truthful')

>>> from earthy.nltk_wrappers import pywsd_lemmatize
>>> pywsd_lemmatize('surahs')
'surahs'

>>> from earthy.nltk_wrappers import porter_stem
>>> porter_stem('surahs')
u'surah'
alvas
  • 115,346
  • 109
  • 446
  • 738
  • wow, thanks. this is cool. but what is "earthy" module and where can I get that? I can't call "earthy", the module's name is undefined. – sang Jun 05 '17 at 06:44
  • `pip install -U earthy` – alvas Jun 05 '17 at 07:28
  • wow cool thanks, I have installed. is there any books or tutorial for earthy library? – sang Jun 06 '17 at 03:56
  • There's https://github.com/alvations/earthy/blob/master/FAQ.md but if you want a more serious tool, try `spacy` https://spacy.io – alvas Jun 06 '17 at 04:01