python sklearn using more than just the count features for naive bayes learning

Question

first of all, I am new to python and nlp / machine learning. right now I have the following code:

vectorizer = CountVectorizer(
   input="content", 
   decode_error="ignore", 
   strip_accents=None,
   stop_words = stopwords.words('english'),
   tokenizer=myTokenizer
)
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['sentiment'].values
classifier.fit(counts, targets)

now this actually works pretty well. I am getting a sparse matrix through the CountVectorizer and the classifier makes use of the matrix as well as the targets (0,2,4).

However, what would I have to do if I wanted to use more features in the vector instead of just the word counts? I can't seem to find that out. Thank you in advance.

Possible duplicate of [How to add another feature (length of text) to current bag of words classification? Scikit-learn](https://stackoverflow.com/questions/39121104/how-to-add-another-feature-length-of-text-to-current-bag-of-words-classificati) — penguin2048, Mar 30 '18 at 18:51

score 1 · Accepted Answer · answered Dec 01 '16 at 20:11

In your case counts is a sparse matrix; you can add columns to it with extra features:

import numpy as np
from scipy import sparse as sp

counts = vectorizer.fit_transform(data['message'].values)
ones = np.ones(shape=(len(data), 1))
X = sp.hstack([counts, ones])

classifier.fit(X, targets)

scikit-learn also provides a built-in helper for that; it is called FeatureUnion. There is an example of combining features from two transformers in scikit-learn docs:

estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)

# then you can do this:
X = combined.fit_transform(my_data)

FeatureUnion does almost the same: it takes a list of vectorizers (with names), calls them all for the same input data, then concatenates the result column-wise.

It is usually better to use FeatureUnion because you will have easier time using scikit-learn cross-validation, pickling the final pipeline, etc.

See also these tutorials:

score 0 · Answer 2 · answered Nov 28 '16 at 21:20

0

It depends on your data and what you are trying to do. There are different transformation methods you can use beside the word counts: Bag of Words, TFIDF, Word Vector, ...

You can read more from these documentation: - http://billchambers.me/tutorials/2015/01/14/python-nlp-cheatsheet-nltk-scikit-learn.html - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

answered Nov 28 '16 at 21:20

Tuan Vu

708
7
15

hi, thanks for your answer. those links seem helpful. However, I think my question is actually even simpler than you might think. I realise there are many more vectorizer available. But lets just say I wanted to use the overall word count of the message itself as an additional feature. that would be a simple integer. currently, the `classifier.fit` function uses the matrix returned by the `CountVectorizer`. How do I add the word count to the vector of features used by the `classifier`, to make it use both `counts` and `overall word count`? – Micha Nov 30 '16 at 10:48

python sklearn using more than just the count features for naive bayes learning

2 Answers2