2

I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:

vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()

And I got the following output:

[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',

and so on.

I need real feature names (words), not these codes. Can anybody help me please?

UPDATE: I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?

Screenshot

Screensot of the data.head()

Dmitrij Burlaj
  • 55
  • 1
  • 3
  • 9

3 Answers3

3

You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.

Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.

dhanush-ai1990
  • 325
  • 4
  • 20
  • I already changed it to min_df=2, but this did not solve the problem. I am afraid that when I try to increase this hyperparameter more, then I can lose some important words, which can be very useful in the further classification of the text. I think I need to preprocess the text in some way instead, but I don't know how to do this. – Dmitrij Burlaj Nov 22 '17 at 09:15
  • Can you try using Spacy to preprocess the text ?. Spacy is a NLP toolkit which is very good for named entity recognition and dependency parsing etc. What you could do is that you could tokenize the text using Spacy and select only the noun phrases/words using dependency parsing and use that as your vocabulary. – dhanush-ai1990 Nov 24 '17 at 16:33
  • Ok, how can I select only the noun words using dependency parsing in Spacy? Do you know some good tutorials about this? – Dmitrij Burlaj Nov 24 '17 at 18:16
  • I have 32-bit Windows 7, is it possible to install Spacy on it? On the Spacy website, they say that it compatible with 64bit Python... – Dmitrij Burlaj Nov 24 '17 at 18:20
  • Or can I do something similar with NLTK? – Dmitrij Burlaj Nov 24 '17 at 19:26
  • I think you can do something similar with NLTK. The below StackOverflow post can help you: https://stackoverflow.com/questions/7443330/how-do-i-do-dependency-parsing-in-nltk – dhanush-ai1990 Nov 27 '17 at 04:51
  • Thanks, I have parsed nouns with NLTK already, but unfortunately, this didn't help to get rid from senseless combinations of letters which CountVectorizer consider as words... And this makes my models to perform poorly. What else can be done? – Dmitrij Burlaj Nov 29 '17 at 05:06
  • I know I come late / but I just stumble into this . use the vocabulary attrib to get what you what you want – Herc01 Jun 05 '19 at 12:39
1

Here is what you can do get what you exactly want:

  vectorizer=CountVectorizer()
  vectorizer.fit_transform(df['message_encoding'])
  feat_dict=vectorizer.vocabulary_.keys()
Herc01
  • 610
  • 1
  • 8
  • 17
  • @Dmitrij Burlaj dont forget to give me a little if ever that works – Herc01 Jun 05 '19 at 12:43
  • Don't forget to use `sorted(vectorizer.vocabulary_.keys())` in order to preserve the same order as `vectorizer.get_feature_names()` return features name already sorted !!! – Boubacar Traoré Jul 18 '19 at 12:51
0

instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.