0

I want to convert a list of words to a list of integers in scikit-learn, and do so for a corpus that consists of a list of lists of words. E.g. the corpus can be a bunch of sentences.

I can do as follows using sklearn.feature_extraction.text.CountVectorizer, but is there any simpler way? I suspect I may be missing some CountVectorizer functionalities, as it's a common pre-processing step in natural language processing. In this code I first fit CountVectorizer, then I have to iterate over each words of each list of words to generate the list of integers.

import sklearn
import sklearn.feature_extraction
import numpy as np

def reverse_dictionary(dict):
    '''
    http://stackoverflow.com/questions/483666/python-reverse-inverse-a-mapping
    '''
    return {v: k for k, v in dict.items()}

vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

X = vectorizer.fit_transform(corpus).toarray()

tokenizer = vectorizer.build_tokenizer()
output_corpus = []
for line in corpus: 
    line = tokenizer(line.lower())
    output_line = np.empty_like(line, dtype=np.int)
    for token_number, token in np.ndenumerate(line):
        output_line[token_number] = vectorizer.vocabulary_.get(token) 
    output_corpus.append(output_line)
print('output_corpus: {0}'.format(output_corpus))

word2idx = vectorizer.vocabulary_
print('word2idx: {0}'.format(word2idx))

idx2word = reverse_dictionary(word2idx)
print('idx2word: {0}'.format(idx2word))

outputs:

output_corpus: [array([9, 3, 7, 2, 1]), # 'This is the first document.'
                array([9, 3, 7, 6, 6, 1]), # 'This is the second second document.'
                array([0, 7, 8, 4]), # 'And the third one.'
                array([3, 9, 7, 2, 1, 9, 3, 5])] # 'Is this the first document? This is right.'
word2idx: {u'and': 0, u'right': 5, u'third': 8, u'this': 9, u'is': 3, u'one': 4,
           u'second': 6, u'the': 7, u'document': 1, u'first': 2}
idx2word: {0: u'and', 1: u'document', 2: u'first', 3: u'is', 4: u'one', 5: u'right', 
           6: u'second', 7: u'the', 8: u'third', 9: u'this'}
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501

3 Answers3

4

I don't know if there is a more direct way, but you can simplify the syntax by using map instead of for-loop to iterate over each word.

And you can use build_analyzer(), which handles both preprocessing and tokenization, then there is no need to call lower() explicitly.

analyzer = vectorizer.build_analyzer()
output_corpus = [map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line)) for line in corpus]
# For Python 3.x it should be
# [list(map(lambda x: vectorizer.vocabulary_.get(x), analyzer(line))) for line in corpus]

output_corpus:

[[9, 3, 7, 2, 1], [9, 3, 7, 6, 6, 1], [0, 7, 8, 4], [3, 9, 7, 2, 1, 9, 3, 5]]

Edit

Thanks to @user3914041, just using list comprehension might be preferable in this case. It avoids lambda thus can be slightly faster than map. (According to Python List Comprehension Vs. Map and my simple tests.)

output_corpus = [[vectorizer.vocabulary_.get(x) for x in analyzer(line)] for line in corpus]
Community
  • 1
  • 1
yangjie
  • 6,619
  • 1
  • 33
  • 40
  • 2
    I don't think it gets much better than this, I'm not aware of a way to do this using `CountVectorizer`. I prefer the list comprehension syntax though: `[[vectorizer.vocabulary_.get(x) for x in analyzer(line)] for line in corpus]` – ldirer Aug 25 '15 at 09:01
  • 1
    @user3914041Hmm... I agree that list comprehension is preferable in this case – yangjie Aug 25 '15 at 09:21
1

I often use Counter to solve this in python, e.g.

from collections import Counter

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]
​
#convert to str from list and split
as_one = ''
for sentence in corpus:
    as_one = as_one + ' ' + sentence

words = as_one.split()
​
from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
​
print(vocab_to_int)

output:

{'the': 1, 'This': 2, 'is': 3, 'first': 4, 'document.': 5, 'second': 6, 'And': 7, 'third': 8, 'one.': 9, 'Is': 10, 'this': 11, 'document?': 12, 'right.': 13}

0

For a given text, the CountVectorizer is meant to return a vector that is the count of each word.

E.g. for the corpus: corpus = ["the cat", "the dog"], the vectorizer will find 3 different words, thus it will output vectors of dimension 3 where "the" would correspond to the first dimension, "cat" to the second one, and "dog" to the third one. For instance, "the cat" would be transformed to [1, 1, 0], "the dog" to [1, 0, 1], and sentences with repeated words would have larger values (e.g. "the cat cat" → [1, 2, 0]).

For what you want to to, you would have a good time with the Zeugma package. You would just have to do the following (after runnning pip install zeugma in a terminal):

>>> from zeugma import TextsToSequences
>>> sequencer = TextsToSequences()
>>> sequencer.fit_transform(["this is a sentence.", "and another one."])
array([[1, 2, 3, 4], [5, 6, 7]], dtype=object)

And you can always access the "index to word mapping: with

>>> sequencer.index_word
{1: 'this', 2: 'is', 3: 'a', 4: 'sentence', 5: 'and', 6: 'another', 7: 'one'}

From there you can transform any new sentence with this mapping:

>>> sequencer.transform(["a sentence"])
array([[3, 4]])

I hope it helps!

Wajsbrot
  • 331
  • 2
  • 5