3

I have the following code:

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

Now Im trying to calculate the word frequency like this:

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

Next I would like to print the voculabary. Therefore I do:

vectorizer.fit_transform(train_set)
print vectorizer.vocabulary

Right now I get the ouput none. While I expect something like:

{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}

Any thoughts where this goes wrong?

Frits Verstraten
  • 2,049
  • 7
  • 22
  • 41
  • 1
    Possible duplicate of [CountVectorizer does not print vocabulary](http://stackoverflow.com/questions/28894756/countvectorizer-does-not-print-vocabulary) – José Sánchez Jan 17 '17 at 14:20

2 Answers2

5

I think you can try this:

print vectorizer.vocabulary_
José Sánchez
  • 1,126
  • 2
  • 11
  • 20
4

CountVectorizer doesn't support what you are looking for.

You can use the Counter class:

from collections import Counter

train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
    word_counter.update(s.split())

print(word_counter)

Gives

Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})

Or you can use FreqDist from nltk:

from nltk import FreqDist

train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
    word_dist.update(s.split())

print(dict(word_dist))

Gives

{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}
Aris F.
  • 1,105
  • 1
  • 11
  • 27