TL;DR: c_vec.get_feature_names()[3744]
will do your job; read below for the details.
Your starting point is the .vocabulary_
attribute [see EDIT at the end for a more straightforward way], which, according to the documentation, provides a dictionary with
A mapping of terms to feature indices.
Adapting the example from the documentation:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
we get
print(X)
(0, 8) 1
(0, 3) 1
(0, 6) 1
(0, 2) 1
(0, 1) 1
(1, 8) 1
(1, 3) 1
(1, 6) 1
(1, 1) 2
(1, 5) 1
(2, 8) 1
(2, 3) 1
(2, 6) 1
(2, 0) 1
(2, 7) 1
(2, 4) 1
(3, 8) 1
(3, 3) 1
(3, 6) 1
(3, 2) 1
(3, 1) 1
and
vectorizer.vocabulary_
{'and': 0,
'document': 1,
'first': 2,
'is': 3,
'one': 4,
'second': 5,
'the': 6,
'third': 7,
'this': 8}
So, now your problem has become how to find the dictionary key with a given value.
Modifying method3
from this SO answer (since it does not seem to work with Python 3):
def get_term(dict, search_index):
return list(dict.keys())[list(dict.values()).index(search_index)]
we get:
get_term(vectorizer.vocabulary_, 8)
# 'this'
get_term(vectorizer.vocabulary_, 5)
# 'second'
i.e. exactly what you are after.
Notice that the get_term()
function will return only the first key with the given value; nevertheless, in the specific case here where the dictionary is a vocabulary, this is not an issue, since by definition the values are unique, as it can be easily confirmed from a simple inspection.
Notice also that, although there are some alternatives to method3
in the SO answer linked above, the said method is by far the fastest when it comes to big dictionaries (as is the case with such NLP applications in real-world corpora).
EDIT
As correctly suggested by Ben Reiniger in the comments below, a more straightforward way to get the vocabulary term that corresponds to the k
column of the document-term matrix X
is the k
element of get_feature_names()
:
names_ = vectorizer.get_feature_names() # run it once, as it is costly for large vocabularies
names_[8]
# 'this'
names_[5]
# 'second'