First, lets define what's polysemy.
Polysemy: The coexistence of many possible meanings for a word or phrase.
(Source: https://www.google.com/search?q=polysemy)
From Wordnet:
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
And in WordNet there are several terms that we should be familiar with:
Synset: a distinct concept/meaning
Lemma: a root form of a word
Part-Of-Speech (POS): the linguistic category of a word
Word: a surface form of a word (surface words are not in WordNet)
(Note: @alexis has a good answer on lemma vs synset
: https://stackoverflow.com/a/42050466/610569; See also https://stackoverflow.com/a/23715743/610569 and https://stackoverflow.com/a/29478711/610569)
In code:
from nltk.corpus import wordnet as wn
# Given a word "run"
word = 'run'
# We get multiple meaning (i.e. synsets) for
# the word in wordnet.
for synset in wn.synsets(word):
# Each synset comes with an ID.
offset = str(synset.offset()).zfill(8)
# Each meaning comes with their
# linguistic category (i.e. POS)
pos = synset.pos()
# Usually, offset + POS is the way
# to index a synset.
idx = offset + '-' + pos
# Each meaning also comes with their
# distinct definition.
definition = synset.definition()
# For each meaning, there are multiple
# root words (i.e. lemma)
lemmas = ','.join(synset.lemma_names())
print ('\t'.join([idx, definition, lemmas]))
[out]:
00189565-n a score in baseball made by a runner touching all four bases safely run,tally
00791078-n the act of testing something test,trial,run
07460104-n a race run on foot footrace,foot_race,run
00309011-n a short trip run
01926311-v move fast by using one's feet, with one foot off the ground at any given time run
02075049-v flee; take to one's heels; cut and run scat,run,scarper,turn_tail,lam,run_away,hightail_it,bunk,head_for_the_hills,take_to_the_woods,escape,fly_the_coop,break_away
Going back to the question, how to "compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet"?
Since we're working with WordNet, surface words are out of the way and we're only left with lemmas.
First, we need to define what lemmas are in nouns, verbs, adjective.
from nltk.corpus import wordnet as wn
from collections import defaultdict
words_by_pos = defaultdict(set)
for synset in wn.all_synsets():
pos = synset.pos()
for lemma in synset.lemmas():
words_by_pos[pos].add(lemma)
But this is a simplistic view of the relations between lemmas vs POS:
# There are 5 POS.
>>> words_by_pos.keys()
dict_keys(['a', 's', 'r', 'n', 'v'])
# Some words have multiple POS tags =(
>>> len(words_by_pos['n'])
119034
>>> len(words_by_pos['v'])
11531
>> len(words_by_pos['n'].intersection(words_by_pos['v']))
4062
Let's see if we can ignore that and move on:
# Lets look that the verb 'v' category
num_meanings_per_verb = []
for word in words_by_pos['v']:
# No. of meaning for a word given a POS.
num_meaning = len(wn.synsets(word, pos='v'))
num_meanings_per_verb.append(num_meaning)
print(sum(num_meanings_per_verb) / len(num_meanings_per_verb))
[out]:
2.1866273523545225
What does the number mean? (if it means anything at all)
It means that
- out of every verb in WordNet,
- there is an average of 2 meanings;
- ignoring the fact that some words have more meanings in other POS category
Perhaps, there is some meaning to it, perhaps but if we look at the counts of the values in num_meanings_per_verb
:
Counter({1: 101168,
2: 11136,
3: 3384,
4: 1398,
5: 747,
6: 393,
7: 265,
8: 139,
9: 122,
10: 85,
11: 74,
12: 39,
13: 29,
14: 10,
15: 19,
16: 10,
17: 6,
18: 2,
20: 5,
26: 1,
30: 1,
33: 1})