Firstly, there is no need to iterate from the entity.n.01
top-down to get its hyponyms, you can simply check for root_hypernyms
botton-up in all synsets:
>>> from nltk.corpus import wordnet as wn
>>> len(set(wn.all_synsets('n')))
82115
>>> entity = wn.synset('entity.n.01')
>>> len([i for i in wn.all_synsets('n') if entity in i.root_hypernyms()])
82115
Here's the code of how Synset.root_hypernyms()
work, from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L439:
def root_hypernyms(self):
"""Get the topmost hypernyms of this synset in WordNet."""
result = []
seen = set()
todo = [self]
while todo:
next_synset = todo.pop()
if next_synset not in seen:
seen.add(next_synset)
next_hypernyms = next_synset.hypernyms() + \
next_synset.instance_hypernyms()
if not next_hypernyms:
result.append(next_synset)
else:
todo.extend(next_hypernyms)
return result
There was another way access the hyper/hyponyms but seems like it's not as perfect as the ones in NLTK, see How to get all the hyponyms of a word/synset in python nltk and wordnet?:
>>> len(set([s for s in entity.closure(lambda s:s.hyponyms())]))
74373
To iterate through individually:
>>> for s in entity.closure(lambda s:s.hyponyms()):
... print s
So let's try going bottom-up:
>>> from nltk.corpus import wordnet as wn
>>>
>>> synsets_with_entity_root = 0
>>> entity = wn.synset('entity.n.01')
>>>
>>> for i in wn.all_synsets('n'):
... # Get root hypernym the hard way.
... x = set([s for s in i.closure(lambda s:s.hypernyms())])
... if entity in x:
... synsets_with_entity_root +=1
...
>>> print synsets_with_entity_root
74373
Seems like when parsing the hyper-hyponym tree bottom-up and top-up this way, we're missing ~8000 synsets, so we check:
entity = wn.synset('entity.n.01')
for i in wn.all_synsets('n'):
# Get root hypernym the hard way.
x = set([s for s in i.closure(lambda s:s.hypernyms())])
if entity in x:
synsets_with_entity_root +=1
else:
print i, i.root_hypernyms()
And you'll get a list of the missing ~8000 synsets, here's the first few ones you'll see:
Synset('entity.n.01') [Synset('entity.n.01')]
Synset('hegira.n.01') [Synset('entity.n.01')]
Synset('underground_railroad.n.01') [Synset('entity.n.01')]
Synset('babylonian_captivity.n.01') [Synset('entity.n.01')]
Synset('creation.n.05') [Synset('entity.n.01')]
Synset('berlin_airlift.n.01') [Synset('entity.n.01')]
Synset('secession.n.02') [Synset('entity.n.01')]
Synset('human_genome_project.n.01') [Synset('entity.n.01')]
Synset('manhattan_project.n.02') [Synset('entity.n.01')]
Synset('peasant's_revolt.n.01') [Synset('entity.n.01')]
Synset('first_crusade.n.01') [Synset('entity.n.01')]
Synset('second_crusade.n.01') [Synset('entity.n.01')]
Synset('third_crusade.n.01') [Synset('entity.n.01')]
Synset('fourth_crusade.n.01') [Synset('entity.n.01')]
Synset('fifth_crusade.n.01') [Synset('entity.n.01')]
Synset('sixth_crusade.n.01') [Synset('entity.n.01')]
Synset('seventh_crusade.n.01') [Synset('entity.n.01')]
So possibly the closure()
method is a little lossy but it's a still elegant method if exact numbers are not your concern.