I want to convert a list of words to a list of integers in scikit-learn, and do so for a corpus that consists of a list of lists of words. E.g. the corpus can be a bunch of sentences.
I can do as follows using sklearn.feature_extraction.text.CountVectorizer
, but is there any simpler way? I suspect I may be missing some CountVectorizer functionalities, as it's a common pre-processing step in natural language processing. In this code I first fit CountVectorizer, then I have to iterate over each words of each list of words to generate the list of integers.
import sklearn
import sklearn.feature_extraction
import numpy as np
def reverse_dictionary(dict):
'''
http://stackoverflow.com/questions/483666/python-reverse-inverse-a-mapping
'''
return {v: k for k, v in dict.items()}
vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document? This is right.',]
X = vectorizer.fit_transform(corpus).toarray()
tokenizer = vectorizer.build_tokenizer()
output_corpus = []
for line in corpus:
line = tokenizer(line.lower())
output_line = np.empty_like(line, dtype=np.int)
for token_number, token in np.ndenumerate(line):
output_line[token_number] = vectorizer.vocabulary_.get(token)
output_corpus.append(output_line)
print('output_corpus: {0}'.format(output_corpus))
word2idx = vectorizer.vocabulary_
print('word2idx: {0}'.format(word2idx))
idx2word = reverse_dictionary(word2idx)
print('idx2word: {0}'.format(idx2word))
outputs:
output_corpus: [array([9, 3, 7, 2, 1]), # 'This is the first document.'
array([9, 3, 7, 6, 6, 1]), # 'This is the second second document.'
array([0, 7, 8, 4]), # 'And the third one.'
array([3, 9, 7, 2, 1, 9, 3, 5])] # 'Is this the first document? This is right.'
word2idx: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4,
u'second': 6, u'the': 7, u'document': 1, u'first': 2}
idx2word: {0: u'and', 1: u'document', 2: u'first', 3: u'is', 4: u'one', 5: u'right',
6: u'second', 7: u'the', 8: u'third', 9: u'this'}