0

first of all, I am new to python and nlp / machine learning. right now I have the following code:

vectorizer = CountVectorizer(
   input="content", 
   decode_error="ignore", 
   strip_accents=None,
   stop_words = stopwords.words('english'),
   tokenizer=myTokenizer
)
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['sentiment'].values
classifier.fit(counts, targets)

now this actually works pretty well. I am getting a sparse matrix through the CountVectorizer and the classifier makes use of the matrix as well as the targets (0,2,4).

However, what would I have to do if I wanted to use more features in the vector instead of just the word counts? I can't seem to find that out. Thank you in advance.

Micha
  • 523
  • 10
  • 26
  • Possible duplicate of [How to add another feature (length of text) to current bag of words classification? Scikit-learn](https://stackoverflow.com/questions/39121104/how-to-add-another-feature-length-of-text-to-current-bag-of-words-classificati) – penguin2048 Mar 30 '18 at 18:51

2 Answers2

1

In your case counts is a sparse matrix; you can add columns to it with extra features:

import numpy as np
from scipy import sparse as sp

counts = vectorizer.fit_transform(data['message'].values)
ones = np.ones(shape=(len(data), 1))
X = sp.hstack([counts, ones])

classifier.fit(X, targets)

scikit-learn also provides a built-in helper for that; it is called FeatureUnion. There is an example of combining features from two transformers in scikit-learn docs:

estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)

# then you can do this:
X = combined.fit_transform(my_data)

FeatureUnion does almost the same: it takes a list of vectorizers (with names), calls them all for the same input data, then concatenates the result column-wise.

It is usually better to use FeatureUnion because you will have easier time using scikit-learn cross-validation, pickling the final pipeline, etc.

See also these tutorials:

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
0

It depends on your data and what you are trying to do. There are different transformation methods you can use beside the word counts: Bag of Words, TFIDF, Word Vector, ...

You can read more from these documentation: - http://billchambers.me/tutorials/2015/01/14/python-nlp-cheatsheet-nltk-scikit-learn.html - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Tuan Vu
  • 708
  • 7
  • 15
  • hi, thanks for your answer. those links seem helpful. However, I think my question is actually even simpler than you might think. I realise there are many more vectorizer available. But lets just say I wanted to use the overall word count of the message itself as an additional feature. that would be a simple integer. currently, the `classifier.fit` function uses the matrix returned by the `CountVectorizer`. How do I add the word count to the vector of features used by the `classifier`, to make it use both `counts` and `overall word count`? – Micha Nov 30 '16 at 10:48