2

I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)

So I looked for a way to get forms of a word.

This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:

from nltk.corpus import wordnet as wn    

str = "retrieving"

synsets = wn.synsets(str)

s = set()
result = ""
for synset in synsets:
    related = None
    lemmas = synset.lemmas()
    for lemma in lemmas:
        forms = lemma.derivationally_related_forms()
        for form in forms:
            name = form.name()
            s.add(name)    

print(list(s))

The output is:

['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']

But what I really want is only : 'retrieval' , 'retriever' , not 'think' or 'recovery'...etc

and the result is also missing other forms, such as: 'retrieve'

I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms

Is there a way to get the result I am expecting?

Nina
  • 508
  • 4
  • 21

1 Answers1

3

You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.

Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )

The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.

If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.

Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • what do you mean by : "one-off indexing of all entries by the stemmer result" ? – Nina Apr 05 '20 at 14:08
  • @Nina Make an index (a dictionary in Python terms), where the lookup key is the result of stemming each word, and the value is a list of all the words that stem to that key. Then to find all similar words you just need to stem your search term. (The stemming function will be acting like a hash function.) – Darren Cook Apr 05 '20 at 18:21
  • Okay, Got it. And the frst suggestion is nice, too – Nina Apr 05 '20 at 18:46
  • 1
    The first solution still cannot provide missing forms, though, but at least it can eliminated incorrect results. – Nina Apr 05 '20 at 18:47
  • When I looked up "drive" in WordNet online search: http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&s=drive&i=9&h=100100100000000000000000000000000000000000#c I noticed that in derivationally related form section, beside each form there is a note: [Related to:...] Is this note or relation accessable in NLTK? If yes, It could give a good clue – Nina Apr 06 '20 at 19:02
  • @Nina The link in "Related to" seems to be identical to the other link on that same line, so I don't think that is anything extra. There is `pertainyms()` if you haven't tried that. See https://stackoverflow.com/q/28475620/841830 and https://stackoverflow.com/q/14489309/841830 for other ideas. – Darren Cook Apr 07 '20 at 07:31