Extract word from a list of synsets in NLTK for Python

Question

Using this [x for x in wn.all_synsets('n')] I am able to get a list allnouns with all nouns from Wordnet with help from NLTK.

The list allnouns looks like this Synset('pile.n.01'), Synset('compost_heap.n.01'), Synset('mass.n.03') and so on. Now I am able to get any element by using allnouns[2] and this should be Synset('mass.n.03').

I would like to extract only the word mass but for some reason I cannot treat it like a string and everything I try shows a AttributeError: 'Synset' object has no attribute or TypeError: 'Synset' object is not subscriptable or <bound method Synset.name of Synset('mass.n.03')> if I try to use .name or .pos

kmario23 · Accepted Answer · 2018-01-18T01:16:51.920

5

How about trying this solution:

>>>> from nltk.corpus import wordnet as wn
>>>> wn.synset('mass.n.03').name().split(".")[0]
'mass'

For your case:

>>>> allnouns = [x for x in wn.all_synsets('n')]

The item at 23rd index is "Synset('substance.n.07')". Now, you can extract its name field like

>>>> allnouns[23].name().split(".")[0]
'substance'   #output

If you want only the 'name' fields of the synsets of 'noun' category in the list, then use:

>>>> [x.name().split(".")[0] for x in wn.all_synsets('n')]

should exactly give the result you need.

Note: In wordnet, name is not an attribute rather it is a function!

edited Jan 18 '18 at 01:16

answered Jun 12 '15 at 22:13

kmario23

57,311
13
161
150

Well, I am not trying to get a single word, like I said I will gather all nouns in a list with [x for x in wn.all_synsets('n')] and then from that list I pick one element let's say allnouns[2] that happends to be Synset('mass.n.03') and now I want to extract the string "mass". I've seen that code of yours earlier somewhere but that searches for a particular word - not my case – faceoff Jun 12 '15 at 22:18
One of us does it wrong as I don't think we understand eachother :) I am sure your code runs just fine but you're missing the point or I fail at explaining it. Where did you get "mass.n.03" from? – faceoff Jun 12 '15 at 22:36
I have a list of elements containing all nouns generated with the above code - now my list is as follows: allnouns = Synset('pile.n.01'), Synset('compost_heap.n.01'), Synset('mass.n.03'), Synset('dunghill.n.02'), Synset('logjam.n.02'), Synset('shock.n.08') - how can I get only the words like "compost" or "mass" or "shock" from that list? I have done x = allnouns[2] and printing it will show "Synset('mass.n.03')" - that's great, now I want the string "mass" extracted – faceoff Jun 12 '15 at 22:43
That's just an use case! After you make the 'allnouns', do you want to extract all the words and put it in a list again? – kmario23 Jun 12 '15 at 22:43
I have allnouns = Synset('pile.n.01'), Synset('compost_heap.n.01'), Synset('mass.n.03'), Synset('dunghill.n.02'), Synset('logjam.n.02'), Synset('shock.n.08') and so on 82115 items in it now. I pick a random number "rnd" 0 to 82115 and get allnouns[rnd] and I might get "Synset('mass.n.02')" Now, I want to extract "mass" in a string variable – faceoff Jun 12 '15 at 22:45
1

It works! My bad, you're right - Just run the damn code! :) This worked: [x for x in wn.all_synsets('n')][2].name().split(".")[0] – faceoff Jun 12 '15 at 22:56
I still don't get this part: wn.synset('mass.n.03').name().split(".")[0] because the only way I am referring to the mass.n.03 is by randomly selecting it from a list as you have done when using [2] – faceoff Jun 12 '15 at 22:59
1

Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/80442/discussion-between-faceoff-and-mario23). – faceoff Jun 12 '15 at 23:02

score 0 · Answer 2 · edited May 23 '17 at 11:58

Using Synset.names() to get the canonical lemma name of the synset:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('mass', 'n')
[Synset('mass.n.01'), Synset('batch.n.02'), Synset('mass.n.03'), Synset('mass.n.04'), Synset('mass.n.05'), Synset('multitude.n.03'), Synset('bulk.n.02'), Synset('mass.n.08'), Synset('mass.n.09')]
>>> wn.synsets('mass', 'n')[0]
Synset('mass.n.01')
>>> wn.synsets('mass', 'n')[0].name()
u'mass.n.01'
>>> wn.synsets('mass', 'n')[0].name().split('.')[0]
u'mass'

But do note that sometimes a synset is made up of several lemmas, so you should use Synset.lemma_names() to access all lemmas if you're using the surface word form of a synset:

>>> wn.synsets('mass', 'n')[0].lemmas()
[Lemma('mass.n.01.mass')]
>>> wn.synsets('mass', 'n')[0].lemma_names()
[u'mass']
>>> wn.synsets('mass', 'n')[0].definition()
u'the property of a body that causes it to have weight in a gravitational field'

In the wn.synsets('mass', 'n')[0] case there's only 1 lemma attached to the synset. But sometimes there's more than one, e.g.

>>> wn.synsets('mass', 'n')[1].lemma_names()
[u'batch', u'deal', u'flock', u'good_deal', u'great_deal', u'hatful', u'heap', u'lot', u'mass', u'mess', u'mickle', u'mint', u'mountain', u'muckle', u'passel', u'peck', u'pile', u'plenty', u'pot', u'quite_a_little', u'raft', u'sight', u'slew', u'spate', u'stack', u'tidy_sum', u'wad']
>>> wn.synsets('mass', 'n')[1].definition()
u"(often followed by `of') a large number or amount or extent"

And to exact all list of words in wordnet, you can try:

>>> from itertools import chain
>>> set(chain(*[i.lemma_names() for i in wn.all_synsets('n')]))
>>> len(set(chain(*[i.lemma_names() for i in wn.all_synsets('n')])))
119034

See Making a flat list out of list of lists in Python

Extract word from a list of synsets in NLTK for Python

2 Answers2