17

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).

Now, I'm trying to apply PCA on this data, but python is giving some errors.

My code for classification with Naive Bayes :

from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)

x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)

This naive bayes classification gives that output :

>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
    with 6302 stored elements in Compressed Sparse Row format>

>>> print(x_train)
(0, 2966)   1
(0, 1974)   1
(0, 3296)   1
..
..
(42, 1629)  1
(42, 2833)  1
(42, 876)   1

Than I try to apply PCA on my data (temizdata) :

>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)

but this raise following erros:

raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.

My main aim is that test PCA effect on Classification on text.

Convert to dense array :

v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)

Finally try classfy :

classifer.fit(pca_t,y_train)

error for final classfy :

raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative

On one side my data (temizdata) is put in Naive Bayes only, on the other side temizdata firstly put in PCA (for reduce inputs) than classify. __

Rodrigo Laguna
  • 1,796
  • 1
  • 26
  • 46
zer03
  • 325
  • 1
  • 4
  • 15
  • I don't see why this should not work. How do you convert to a dense array and what error do you get then? – MB-F Jan 11 '16 at 16:03
  • Are you using an old version of scikit-learn? I don't think `from sklearn import PCA` is possible in recent versions... – MB-F Jan 11 '16 at 16:04
  • @kazemakase Im sorry I write wrong. I can convert to dense or numpy but NaiveBayes not working with new dense matrix. I added – zer03 Jan 11 '16 at 16:17

3 Answers3

25

Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data) 

And, citing from the TruncatedSVD documentation:

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

which is exactly your use case.

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
  • 2
    This appears more useful than my suggestion. – MB-F Jan 11 '16 at 16:30
  • Thanks comment . But after TruncatedSVD , naivebayes classification gave same error : raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative – zer03 Jan 11 '16 at 16:30
  • 1
    @zer03 as the error tells you, you cannot pass a negative features to the MultinomialNB, and dimensionality reduction algorithms tend to do so (put the data in the [-1, 1] range). So, either you choose another training algorithm (different from NB), or you don't apply PCA, but you cannot use both together. From the [documentation of MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): `The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.`. – Imanol Luengo Jan 11 '16 at 16:38
  • @kazemakase you were also right in the part that only positive numbers are permitted with NB, so if the OP still wants to use `MultinomialNB`, my answer is no longer valid. But if he still wants to do dimensionality reduction, `TruncatedSVD` is the way to go. – Imanol Luengo Jan 11 '16 at 16:41
  • @imaluenge. thanks very much bro. Actually, I has researching tf-idf . If I cant do with another classification (forexample, SVM etc.) I 'll start to study tf-idf. Actually My main goal is distinguish classification / PCA (or feature reduced ) applied classification – zer03 Jan 11 '16 at 18:16
5

The NaiveBayes classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.

There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.

side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.

MB-F
  • 22,770
  • 4
  • 61
  • 116
  • Thanks very much all answers @kazemakase. You helped me too. for side note , this method maybe effect bad to result. But, even so I'll try – zer03 Jan 11 '16 at 16:37
0

The problem is that by applying dimensionality reduction, you will generate negative features. However, Multinominal NB does not take negative features. Please refer to this questions.

Try another classifier such as RandomForest or try using sklearn.preprocessing.MinMaxScaler() to scale your training features to [0,1]

Justin Lange
  • 897
  • 10
  • 25