I am trying to use scikit-learn for predicting a value for an input text string.I am using HashingVectorizer for data vectorization and PassiveAggressiveClassifier for learning using partial_fit (refer to following code):
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import zero_one_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier, Perceptron
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pickle
a,r = [],[]
vectorizer = TfidfVectorizer()
with open('val', 'rb') as f:
r = pickle.load(f)
with open('text', 'rb') as f:
a = pickle.load(f)
L = (vectorizer.fit_transform(a))
training_set = L[:3250]
testing_set = L[3250:]
M = np.array(r)
training_result = M[:3250]
testing_result = M[3250:]
cls = np.unique(r)
model = PassiveAggressiveClassifier()
model.partial_fit(training_set, training_result, classes=cls)
print(model)
predicted = model.predict(testing_set)
print testing_result
print predicted
Error log:
File "try.py", line 89, in <module>
model.partial_fit(training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 374, in _partial_fit
coef_init, intercept_init)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 167, in _allocate_parameter_mem
dtype=np.float64, order="C")
MemoryError
I was previously using CountVectorizer and Logical Regression for classification and that worked without issues. But my learning data is approx. of millions of lines and I want to implement incremental learning using the above script which is causing Memory Error on each execution.
UPDATE:
After applying partial learning in loop, the partial_fit function returns unmatched number of features error(ValueError: Number of features 8897 does not match previous data 9190.
)
Also even if I set the max features attribute then the prediction generated is incorrect.
Is there any way with which the partial_fit method takes variable number of features?
Execution Output:
(400, 8481)
(400, 9277)
Traceback (most recent call last):
File "f9.py", line 65, in <module>
training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 379, in _partial_fit
% (n_features, self.coef_.shape[-1]))
ValueError: Number of features 9277 does not match previous data 8481.
Any help will be appreciated.
Thanks!