1

Due to some reasons, I need to iterate all noun synsets in WordNet3.0 and make them a tree structure in my program.

But when I try this by the code listed below

from nltk.corpus import wordnet as wn
stack = []
duplicate_check = []
def iterate_all():
    while(stack):
        current_node = stack.pop()
        print current_node,"on top"
        for hypo in current_node.hyponyms():
            stack.append(hypo)
            duplicate_check.append(hypo)
if __name__ == "__main__":
    root = wn.synset("entity.n.01")
    stack.append(root)
    duplicate_check.append(root)
    iterate_all()
    correct_list = list(wn.all_synsets('n'))
#    print list( set(correct_list) - set(duplicate_check) )
    print len(correct_list)
    print len(duplicate_check)

I got 96,308 records of duplicate_check while 82,115 in correct_list. The latter one, correct_list contains right number of synsets but not duplicate_check

After covert both lists into set and checking the elements in both list, I found I would lose the relation of "instance of" in noun relationships by the code listed above. So could anyone tell me:

(1) Is "hyponyms" relation equals to "instance of" or not in WordNet 3.0?

(2) Is there any wrong in my code that makes me cannot add "instance of" relation word in duplicate_list?

I'm very appreciate for your time.

Environment: Ubuntu 14.04 + Python 2.7 + NLTK latest version + WordNet 3.0

2 Answers2

0

Firstly, there is no need to iterate from the entity.n.01 top-down to get its hyponyms, you can simply check for root_hypernyms botton-up in all synsets:

>>> from nltk.corpus import wordnet as wn
>>> len(set(wn.all_synsets('n')))
82115
>>> entity = wn.synset('entity.n.01')
>>> len([i for i in wn.all_synsets('n') if entity in i.root_hypernyms()])
82115

Here's the code of how Synset.root_hypernyms() work, from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L439:

def root_hypernyms(self):
    """Get the topmost hypernyms of this synset in WordNet."""

    result = []
    seen = set()
    todo = [self]
    while todo:
        next_synset = todo.pop()
        if next_synset not in seen:
            seen.add(next_synset)
            next_hypernyms = next_synset.hypernyms() + \
                next_synset.instance_hypernyms()
            if not next_hypernyms:
                result.append(next_synset)
            else:
                todo.extend(next_hypernyms)
    return result

There was another way access the hyper/hyponyms but seems like it's not as perfect as the ones in NLTK, see How to get all the hyponyms of a word/synset in python nltk and wordnet?:

>>> len(set([s for s in entity.closure(lambda s:s.hyponyms())]))
74373

To iterate through individually:

>>> for s in entity.closure(lambda s:s.hyponyms()):
...     print s

So let's try going bottom-up:

>>> from nltk.corpus import wordnet as wn
>>> 
>>> synsets_with_entity_root = 0
>>> entity = wn.synset('entity.n.01')
>>> 
>>> for i in wn.all_synsets('n'):
...     # Get root hypernym the hard way.
...     x = set([s for s in i.closure(lambda s:s.hypernyms())])
...     if entity in x:
...             synsets_with_entity_root +=1
... 

>>> print synsets_with_entity_root
74373

Seems like when parsing the hyper-hyponym tree bottom-up and top-up this way, we're missing ~8000 synsets, so we check:

entity = wn.synset('entity.n.01')

for i in wn.all_synsets('n'):
    # Get root hypernym the hard way.
    x = set([s for s in i.closure(lambda s:s.hypernyms())])
    if entity in x:
        synsets_with_entity_root +=1
    else:
        print i, i.root_hypernyms()

And you'll get a list of the missing ~8000 synsets, here's the first few ones you'll see:

Synset('entity.n.01') [Synset('entity.n.01')]
Synset('hegira.n.01') [Synset('entity.n.01')]
Synset('underground_railroad.n.01') [Synset('entity.n.01')]
Synset('babylonian_captivity.n.01') [Synset('entity.n.01')]
Synset('creation.n.05') [Synset('entity.n.01')]
Synset('berlin_airlift.n.01') [Synset('entity.n.01')]
Synset('secession.n.02') [Synset('entity.n.01')]
Synset('human_genome_project.n.01') [Synset('entity.n.01')]
Synset('manhattan_project.n.02') [Synset('entity.n.01')]
Synset('peasant's_revolt.n.01') [Synset('entity.n.01')]
Synset('first_crusade.n.01') [Synset('entity.n.01')]
Synset('second_crusade.n.01') [Synset('entity.n.01')]
Synset('third_crusade.n.01') [Synset('entity.n.01')]
Synset('fourth_crusade.n.01') [Synset('entity.n.01')]
Synset('fifth_crusade.n.01') [Synset('entity.n.01')]
Synset('sixth_crusade.n.01') [Synset('entity.n.01')]
Synset('seventh_crusade.n.01') [Synset('entity.n.01')]

So possibly the closure() method is a little lossy but it's a still elegant method if exact numbers are not your concern.

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thank you, sir. I really appreciate, however, the explicit number of synsets is very important for me, for I need to reduce information loss as much as I could, but your answer gave me a great idea that using closure() to reduce the loss. I'm very sorry to reply late and I'm not by any means to be offensive if this makes you uncomfortable. Thank you :) I would try it, and if I solved the problem, which, seems to be impossible, I would add another comment. Thanks a lot ^_^. – 许94.94 XU Jan 29 '15 at 17:35
0

This code prevents the error from happening:

from nltk.corpus import wordnet
L=len(wordnet.synsets('rock', pos='n'))
for i in range(0,L):
    syn = wordnet.synsets(word)[i].name()
    print("---------",syn,"--------",wordnet.synsets(word)[i].definition())
    while(syn!='entity.n.01'):
        hyper=wordnet.synset(syn).hypernyms() 
        if len(hyper)>0:
            name=hyper[0].name().split('.')[0]
            print(name)
        else:
            print("**** INSTANCE_OF sense, without any hypernyms ****")
            break
        syn=hyper[0].name()

And here is the result:

--------- bath.n.01 -------- a vessel containing liquid in which something is immersed (as to process it or to maintain it at a constant temperature or to lubricate it)
vessel
container
instrumentality
artifact
whole
object
physical_entity
entity
--------- bath.n.02 -------- you soak and wash your body in a bathtub
washup
cleaning
improvement
change_of_state
change
action
act
event
psychological_feature
abstraction
entity
--------- bathtub.n.01 -------- a relatively large open container that you fill with water and use to wash the body
vessel
container
instrumentality
artifact
whole
object
physical_entity
entity
--------- bath.n.04 -------- an ancient Hebrew liquid measure equal to about 10 gallons
liquid_unit
volume_unit
unit_of_measurement
definite_quantity
measure
abstraction
entity
--------- bath.n.05 -------- a town in southwestern England on the River Avon; famous for its hot springs and Roman remains
**** INSTANCE_OF sense, without any hypernyms ****
--------- bathroom.n.01 -------- a room (as in a residence) containing a bathtub or shower and usually a washbasin and toilet
room
area
structure
artifact
whole
object
physical_entity
entity
--------- bathe.v.03 -------- clean one's body by immersion into water
cleanse
groom
fancify
better
change
Hasan Zafari
  • 355
  • 2
  • 6