0

I have encoded a text data set using the Sklearn CountVectorizer method, e.g.:

c_vec = CountVectorizer(stop_words=stopwords)

where the stop words were generated by nltk.

I used output = c_vec.fit_transform(data) to encode my dataset. I then want to check what the encoder was doing so ran print(output) and got a printout that looks like:

  (0, 3744) 3
  (0, 4511) 2
  (0, 4071) 2
  (0, 1831) 1
  (0, 4321) 2
  (0, 8156) 2
  (0, 7982) 1
  (0, 2714) 1
  (0, 2505) 1
  ...
  (2394, 6070)  1
  (2394, 8559)  2
  (2394, 8087)  1
  (2394, 7997)  8
  (2394, 7827)  1
  (2394, 5159)  5
  (2394, 5396)  1 

My understanding is that with (0, 3744) 3,

  • 0 is the line number of the string from the dataset
  • 3744 is the encoding of the word
  • 3 being the count of that word in the string.

However, I want to be able to see what word is associated with 3744. I have read the documentation on Sklearn, but can't see what I am looking for. Any suggestions?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
GalacticPonderer
  • 497
  • 3
  • 16

1 Answers1

3

TL;DR: c_vec.get_feature_names()[3744] will do your job; read below for the details.


Your starting point is the .vocabulary_ attribute [see EDIT at the end for a more straightforward way], which, according to the documentation, provides a dictionary with

A mapping of terms to feature indices.

Adapting the example from the documentation:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
         ]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

we get

print(X)

  (0, 8)    1
  (0, 3)    1
  (0, 6)    1
  (0, 2)    1
  (0, 1)    1
  (1, 8)    1
  (1, 3)    1
  (1, 6)    1
  (1, 1)    2
  (1, 5)    1
  (2, 8)    1
  (2, 3)    1
  (2, 6)    1
  (2, 0)    1
  (2, 7)    1
  (2, 4)    1
  (3, 8)    1
  (3, 3)    1
  (3, 6)    1
  (3, 2)    1
  (3, 1)    1

and

vectorizer.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

So, now your problem has become how to find the dictionary key with a given value.

Modifying method3 from this SO answer (since it does not seem to work with Python 3):

def get_term(dict, search_index):
    return list(dict.keys())[list(dict.values()).index(search_index)]

we get:

get_term(vectorizer.vocabulary_, 8)
# 'this'

get_term(vectorizer.vocabulary_, 5)
# 'second'

i.e. exactly what you are after.

Notice that the get_term() function will return only the first key with the given value; nevertheless, in the specific case here where the dictionary is a vocabulary, this is not an issue, since by definition the values are unique, as it can be easily confirmed from a simple inspection.

Notice also that, although there are some alternatives to method3 in the SO answer linked above, the said method is by far the fastest when it comes to big dictionaries (as is the case with such NLP applications in real-world corpora).

EDIT

As correctly suggested by Ben Reiniger in the comments below, a more straightforward way to get the vocabulary term that corresponds to the k column of the document-term matrix X is the k element of get_feature_names():

names_ = vectorizer.get_feature_names() # run it once, as it is costly for large vocabularies
names_[8]
# 'this'

names_[5]
# 'second'
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    `get_feature_names` already inverts the `vocabulary_` internally, although it maybe doesn't use the most efficient method?... – Ben Reiniger Apr 01 '21 at 14:22
  • @BenReiniger seems correct, too, and certainly more straightforward; didn't check efficiencies, but I would be surprised if it is *less* efficient. You'll post an answer, or I'll update mine to include this possibility? – desertnaut Apr 01 '21 at 14:26