0

I'm trying to get the synsets of words to get their similarity matrix. However, one of the words is "and." I realized that it is a stopword in nltk and thus may not have a synset. For example,

wn.synsets('and')

simply returns [].

Is there a way to get Synset for stopwords like Synset('and') and I can thus get the path similarity between 'and' and another word?

Zhanwen Chen
  • 1,295
  • 17
  • 21

1 Answers1

2

It is not missing because it is a stopword in nltk; it returns [] because it is not in Wordnet. Two reasons for that:

  1. Conjunctions are not in Wordnet. It has nouns, verbs and adjectives (and a handful of adverbs). Whereas conjunctions practically never have synonyms, or hypernyms.

  2. Wordnet is not comprehensive, even for nouns. It seems none of the boolean operators (AND, OR, XOR, NOT) are in there ("or" and "not" have entries, but not with the boolean operator sense).

If sticking with using path similarity you could hand-encode the ones you feel are missing. E.g. a wrapper that looks up your word, and if not there then it looks it up in wordnet.

Another approach is to use something like word2vec. That will encode "and" (assuming you train without stopwords). You ought to end up with vectors for "and" and "or" being quite close.

Darren Cook
  • 27,837
  • 13
  • 117
  • 217