CountVectorizer method get_feature_names() produces codes but not words

Question

I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:

vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()

And I got the following output:

[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',

and so on.

I need real feature names (words), not these codes. Can anybody help me please?

UPDATE: I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?

In fact, many of my words (features) become connected (without whitespace). I think it decreases the prediction accuracy of my model. What can cause this problem? — Dmitrij Burlaj, Nov 22 '17 at 12:38
This question deserve more attention from `scikit-learn` community. This problem still exists and I waste the whole day trying to figure out what was wrong in my code ! This should have been fixed ! — Boubacar Traoré, Jul 18 '19 at 12:54

score 3 · Answer 1 · answered Nov 22 '17 at 04:53

3

You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.

Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.

answered Nov 22 '17 at 04:53

dhanush-ai1990

325
4
20

I already changed it to min_df=2, but this did not solve the problem. I am afraid that when I try to increase this hyperparameter more, then I can lose some important words, which can be very useful in the further classification of the text. I think I need to preprocess the text in some way instead, but I don't know how to do this. – Dmitrij Burlaj Nov 22 '17 at 09:15
Can you try using Spacy to preprocess the text ?. Spacy is a NLP toolkit which is very good for named entity recognition and dependency parsing etc. What you could do is that you could tokenize the text using Spacy and select only the noun phrases/words using dependency parsing and use that as your vocabulary. – dhanush-ai1990 Nov 24 '17 at 16:33
Ok, how can I select only the noun words using dependency parsing in Spacy? Do you know some good tutorials about this? – Dmitrij Burlaj Nov 24 '17 at 18:16
I have 32-bit Windows 7, is it possible to install Spacy on it? On the Spacy website, they say that it compatible with 64bit Python... – Dmitrij Burlaj Nov 24 '17 at 18:20
Or can I do something similar with NLTK? – Dmitrij Burlaj Nov 24 '17 at 19:26
I think you can do something similar with NLTK. The below StackOverflow post can help you: https://stackoverflow.com/questions/7443330/how-do-i-do-dependency-parsing-in-nltk – dhanush-ai1990 Nov 27 '17 at 04:51
Thanks, I have parsed nouns with NLTK already, but unfortunately, this didn't help to get rid from senseless combinations of letters which CountVectorizer consider as words... And this makes my models to perform poorly. What else can be done? – Dmitrij Burlaj Nov 29 '17 at 05:06
I know I come late / but I just stumble into this . use the vocabulary attrib to get what you what you want – Herc01 Jun 05 '19 at 12:39

score 1 · Answer 2 · answered Jun 05 '19 at 12:42

1

Here is what you can do get what you exactly want:

  vectorizer=CountVectorizer()
  vectorizer.fit_transform(df['message_encoding'])
  feat_dict=vectorizer.vocabulary_.keys()

answered Jun 05 '19 at 12:42

Herc01

610
1
8
17

@Dmitrij Burlaj dont forget to give me a little if ever that works – Herc01 Jun 05 '19 at 12:43
Don't forget to use `sorted(vectorizer.vocabulary_.keys())` in order to preserve the same order as `vectorizer.get_feature_names()` return features name already sorted !!! – Boubacar Traoré Jul 18 '19 at 12:51

Ananya Ray · Answer 3 · 2023-02-08T16:52:27.877

0

instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.

edited Feb 08 '23 at 16:52

answered Feb 08 '23 at 16:51

Ananya Ray

1
1

CountVectorizer method get_feature_names() produces codes but not words

3 Answers3