Difference between the total number of words (length of a list) and vocabulary of a list or file in NLP?

Question

How to compute the total number of words and vocabulary of a corpus stored as a list in python? What is the major difference between these two terms?

Suppose, I am using the following list. The total number of words or the length of the list can be computed by len(L1). However, I am interested to know how to calculate the vocabulary of the below mentioned list.

 L1 = ['newnes', 'imprint', 'elsevier', 'elsevier', 'corporate', 'drive', 'suite', 
'burlington', 'usa', 'linacre', 'jordan', 'hill', 'oxford', 'uk',
'elsevier', 'inc', 'right', 'reserved', 'exception', 'newness', 'uk', 'military',
'organization', 'summary', 'task', 'definition', 'system', 'definition',
'system', 'engineering', 'military', 'project', 'military', 'project', 
'definition', 'input', 'output', 'operation', 'requirement', 'development',
'overview', 'spacecraft', 'development', 'architecture', 'design']

Is it the total number of unique words in a list?? I don't know exactly, but i guess it is some what like that? Is it true? — M S, Sep 25 '18 at 14:10
Ah, I see. I think then you know how to do `len(l1)` for the number of words. And I think vocabulary must refer to some kind of relationship of the words? https://pypi.org/project/Vocabulary/ But there can be many different kinds of relationships, so you'd have to know which kind you want to be able to count them. — sniperd, Sep 25 '18 at 14:17
See https://stackoverflow.com/questions/51943811/does-the-lemmatization-mechanism-reduce-the-size-of-the-corpus/51978364#51978364 and https://stackoverflow.com/questions/52393591/nltk-lemmatizer-extract-meaningful-words/52396249#52396249 — alvas, Sep 26 '18 at 14:42

Shawn Lee · Answer 1 · 2018-09-27T07:59:11.760

If your question is regarding how to get the number of unique words in a list, that can be achieved using sets. (From what I remember from NLP, the vocabulary of a corpus should mean the collection of unique words in that corpus.)

Convert your list to a set using the set() method, then call len() on that. In your case, you would get the number of unique words in the list L1 like so:

len(set(L1))     #number of unique words in L1

Edit: You now mentioned that the vocabulary is the set of lemmatized words. In this case, you would do the same thing except import a lemmatizer from NLTK or whatever NLP library you're using, run your list or whatever into the lemmatizer, and convert the output into a set and proceed with the above.

Note that this might not 100% the answer OP is looking for when it comes to (distinct) word stems in the list of words. Then you would also have to do some prior stemming, i.e. `len(set(stemmed_L1))`. — dennlinger, Sep 26 '18 at 06:20

pajamas · Accepted Answer · 2018-09-26T10:43:41.537

Is this what you're looking for?

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
list_of_tokens = ['cat', 'dog','cats', 'children','dog']
unique_tokens = set(list_of_tokens)
### {'cat', 'cats', 'children', 'dog'}

tokens_lemmatized = [ lemmatizer.lemmatize(token) for token in unique_tokens]
#### ['child', 'cat', 'cat', 'dog']

unique_tokens_lemmatized = set(tokens_lemmatized)
#### {'cat', 'child', 'dog'}

print('Input tokens:',len(list_of_tokens) , 'Lemmmatized tokens:', len(unique_tokens_lemmatized)
#### Input tokens: 5 Lemmmatized tokens: 3

Difference between the total number of words (length of a list) and vocabulary of a list or file in NLP?

2 Answers2