Compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet

Question

I am trying to compute the average polysemy of nouns, verbs, adjectives, and adverbs according to WordNet. This is the function I have defined:

def averagePolysemy(synsets):
    allSynsets = list(wn.all_synsets(synsets))
    lemmas = [synset.lemma_names() for synset in allSynsets]
    senseCount = 0
    for lemma in lemmas:
        senseCount = senseCount + len(wn.synsets(lemma, synsets))
    return senseCount/len(allSynsets)

averagePolysemy(wn.NOUN)

When I call it I get the error:

Traceback (most recent call last):

File "<ipython-input-214-345e72500ae3>", line 1, in <module>
averagePolysemy(wn.NOUN)

File "<ipython-input-213-616cc4af89d1>", line 6, in averagePolysemy
senseCount = senseCount + len(wn.synsets(lemma, synsets))

File "/Users/anna/anaconda/lib/python3.6/site-
packages/nltk/corpus/reader/wordnet.py", line 1483, in synsets
lemma = lemma.lower()

AttributeError: 'list' object has no attribute 'lower'e 'lower'

I'm not sure which part of my function is causing this error.

Looks like maybe `synset.lemma_names` should be `sysnet.lemma_names()`? — BrenBarn, Oct 03 '17 at 04:53
You need to think about what `lemma_names` returns. It looks like it returns a list. Does `synsets` expect a list? It looks like not. — BrenBarn, Oct 03 '17 at 05:01
You might be also interested in computing the "sense entropy" of the word by class, take a look at Eqn 1 from https://www.aclweb.org/anthology/S/S16/S16-1147.pdf (disclaimer: co-author of the paper) — alvas, Oct 05 '17 at 01:32

score 2 · Answer 1 · answered Oct 05 '17 at 02:38

First, lets define what's polysemy.

Polysemy: The coexistence of many possible meanings for a word or phrase.

(Source: https://www.google.com/search?q=polysemy)

From Wordnet:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

And in WordNet there are several terms that we should be familiar with:

Synset: a distinct concept/meaning

Lemma: a root form of a word

Part-Of-Speech (POS): the linguistic category of a word

Word: a surface form of a word (surface words are not in WordNet)

(Note: @alexis has a good answer on lemma vs synset: https://stackoverflow.com/a/42050466/610569; See also https://stackoverflow.com/a/23715743/610569 and https://stackoverflow.com/a/29478711/610569)

In code:

from nltk.corpus import wordnet as wn
# Given a word "run"
word = 'run'
# We get multiple meaning (i.e. synsets) for 
# the word in wordnet.
for synset in wn.synsets(word):
    # Each synset comes with an ID.
    offset = str(synset.offset()).zfill(8)
    # Each meaning comes with their 
    # linguistic category (i.e. POS)
    pos = synset.pos()
    # Usually, offset + POS is the way 
    # to index a synset.
    idx = offset + '-' + pos
    # Each meaning also comes with their
    # distinct definition.
    definition = synset.definition()
    # For each meaning, there are multiple
    # root words (i.e. lemma)
    lemmas = ','.join(synset.lemma_names())
    print ('\t'.join([idx, definition, lemmas]))

[out]:

00189565-n  a score in baseball made by a runner touching all four bases safely run,tally
00791078-n  the act of testing something    test,trial,run
07460104-n  a race run on foot  footrace,foot_race,run
00309011-n  a short trip    run
01926311-v  move fast by using one's feet, with one foot off the ground at any given time   run
02075049-v  flee; take to one's heels; cut and run  scat,run,scarper,turn_tail,lam,run_away,hightail_it,bunk,head_for_the_hills,take_to_the_woods,escape,fly_the_coop,break_away

Going back to the question, how to "compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet"?

Since we're working with WordNet, surface words are out of the way and we're only left with lemmas.

First, we need to define what lemmas are in nouns, verbs, adjective.

from nltk.corpus import wordnet as wn
from collections import defaultdict

words_by_pos = defaultdict(set)

for synset in wn.all_synsets():
    pos = synset.pos()
    for lemma in synset.lemmas():
        words_by_pos[pos].add(lemma)

But this is a simplistic view of the relations between lemmas vs POS:

# There are 5 POS.
>>> words_by_pos.keys() 
dict_keys(['a', 's', 'r', 'n', 'v'])

# Some words have multiple POS tags =(
>>> len(words_by_pos['n'])
119034
>>> len(words_by_pos['v'])
11531
>> len(words_by_pos['n'].intersection(words_by_pos['v']))
4062

Let's see if we can ignore that and move on:

# Lets look that the verb 'v' category
num_meanings_per_verb = []

for word in words_by_pos['v']:
    # No. of meaning for a word given a POS.
    num_meaning = len(wn.synsets(word, pos='v'))
    num_meanings_per_verb.append(num_meaning)
print(sum(num_meanings_per_verb) / len(num_meanings_per_verb))

[out]:

2.1866273523545225

What does the number mean? (if it means anything at all)

It means that

out of every verb in WordNet,
there is an average of 2 meanings;
ignoring the fact that some words have more meanings in other POS category

Perhaps, there is some meaning to it, perhaps but if we look at the counts of the values in num_meanings_per_verb:

Counter({1: 101168,
         2: 11136,
         3: 3384,
         4: 1398,
         5: 747,
         6: 393,
         7: 265,
         8: 139,
         9: 122,
         10: 85,
         11: 74,
         12: 39,
         13: 29,
         14: 10,
         15: 19,
         16: 10,
         17: 6,
         18: 2,
         20: 5,
         26: 1,
         30: 1,
         33: 1})

Compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet

1 Answers1