4

Can someone tell me the difference between a Corpora ,corpus and lexicon in NLTK ?

What is the movie data set ?

what is Wordnet ?

alvas
  • 115,346
  • 109
  • 446
  • 738
Kumar
  • 1,017
  • 1
  • 11
  • 16
  • It is preferred if you can post separate questions instead of combining your questions into one. That way, it helps the people answering your question and also others hunting for at least one of your questions. Thanks! – Rohit Gupta Jul 20 '15 at 21:29
  • hey Rohit, thx for the comment...I added this though as they are all related...answeting one in context of the others would help I beleive... – Kumar Jul 20 '15 at 21:31
  • It's not `machine-learning` per se but it's more NLTK and nlp. – alvas Jul 20 '15 at 21:46

1 Answers1

14

Corpora is the plural for corpus.

Corpus basically means a body, and in the context of Natural Language Processing (NLP), it means a body of text.

(source: https://www.google.com.sg/search?q=corpora)


Lexicon is a vocabulary, a list of words, a dictionary (source: https://www.google.com.sg/search?q=lexicon)

In NLTK, any lexicon is considered a corpus since a list of words is also a body of text. E.g. a list of stopwords can be found in NLTK corpus API:

>>> from nltk.corpus import stopwords
>>> print stopwords.words('english')
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']

The movie review dataset in NLTK (canonically known as Movie Reviews Corpus) is a text dataset of 2k movie reviews with sentiment polarity classification (source: http://www.nltk.org/book/ch02.html)

And it is often used for tutorial purposes for introduction to NLP and sentiment analysis, see http://www.nltk.org/book/ch06.html and nltk NaiveBayesClassifier training for sentiment analysis


WordNet is lexical database for the English language (it's like a lexicon/dictionary with word-to-word relations) (source: https://wordnet.princeton.edu/).

In NLTK, it incorporates the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/) that allows you to query the words in other languages.

Since it is also a list of words (in this case with many other things included, relations, lemmas, POS, etc.), it's also invoked using nltk.corpus in NLTK.

The canonical idiom to use the wordnet in NLTK is as such:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

The easiest way to understand/learn the NLP jargons and the basics is to go through these tutorial in the NLTK book: http://www.nltk.org/book/

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738