1

I am trying to use scikit-learn for predicting a value for an input text string.I am using HashingVectorizer for data vectorization and PassiveAggressiveClassifier for learning using partial_fit (refer to following code):

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import zero_one_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier, Perceptron
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pickle

a,r = [],[]

vectorizer = TfidfVectorizer()

with open('val', 'rb') as f:
    r = pickle.load(f)

with open('text', 'rb') as f:
    a = pickle.load(f)

L = (vectorizer.fit_transform(a))

training_set = L[:3250]
testing_set = L[3250:]

M = np.array(r)

training_result = M[:3250]
testing_result = M[3250:]

cls = np.unique(r)

model = PassiveAggressiveClassifier()

model.partial_fit(training_set, training_result, classes=cls)
print(model)
predicted = model.predict(testing_set)

print testing_result
print predicted

Error log:

File "try.py", line 89, in <module>
    model.partial_fit(training_set, training_result, classes=cls)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
    coef_init=None, intercept_init=None)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 374, in _partial_fit
    coef_init, intercept_init)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 167, in _allocate_parameter_mem
    dtype=np.float64, order="C")
MemoryError

I was previously using CountVectorizer and Logical Regression for classification and that worked without issues. But my learning data is approx. of millions of lines and I want to implement incremental learning using the above script which is causing Memory Error on each execution.

UPDATE:

After applying partial learning in loop, the partial_fit function returns unmatched number of features error(ValueError: Number of features 8897 does not match previous data 9190.) Also even if I set the max features attribute then the prediction generated is incorrect. Is there any way with which the partial_fit method takes variable number of features?

Execution Output:

(400, 8481)
(400, 9277)
Traceback (most recent call last):
  File "f9.py", line 65, in <module>
    training_set, training_result, classes=cls)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
    coef_init=None, intercept_init=None)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 379, in _partial_fit
    % (n_features, self.coef_.shape[-1]))
ValueError: Number of features 9277 does not match previous data 8481.

Any help will be appreciated.

Thanks!

Jatin Bansal
  • 875
  • 12
  • 24
  • After update : Can you give us a little bit more code. When does it crash? after several partial_fit, or at the second one? Can you print the `shape` of your different varaible (set and result) – RPresle Jun 11 '15 at 07:42
  • Updated the question with crash log. – Jatin Bansal Jun 11 '15 at 07:59
  • To my mind the problem may come from the Hashing Vectorizer. But I need to see all the code to be sure and to find the possible reason. Plus, We need more details of where in the execution is the error and every shape of your arrays. As it clearly leave the range of memory error or partial_fit, please consider doing another question and post the link here. – RPresle Jun 11 '15 at 08:12
  • I am not using HashingVectorizer but tried Count & TFid Vectorizers. – Jatin Bansal Jun 11 '15 at 08:53
  • @RPresle, Please refer to the new question, here: http://stackoverflow.com/questions/30776240/incremental-learning-in-scikit-with-passiveaggressiveclassifiers-partial-fit – Jatin Bansal Jun 11 '15 at 09:04

1 Answers1

1

Memory Error is coming from the fact that you have too much data in memory. As you're loading data you have a quantity equal to N, then when you partial_fit, depending on the algorithm, it will store some more data, maybe close to N.

You don't need to store twice your data. Try to reduce the size of your initial chunk of data. Separate it in several parts that you will give to the partial_fit method.

You should read you file line by line to create chunk of data, then fit that chunk, and flush the memory, then do it again

with open(path, "r", encoding='utf-8') as f:
    i = 0
    for line in f:
        % Create chunk of X line
        i ++
        arr.add(line)

        % Learn with partial_fit
        if (i == X):
            model.partial_fit()
            % Flush the last chunk 
            arr = []
RPresle
  • 2,436
  • 3
  • 24
  • 28
  • Thanks for your answer. Let me give it a try. – Jatin Bansal Jun 10 '15 at 04:45
  • I want to save the model using joblib, so do I need to call joblib.dump at the end (outside the loop) and does partial_fit automatically update the previously built model? – Jatin Bansal Jun 10 '15 at 07:51
  • I don't think the memory is linked to the file. Joblib may just take a snapshot of your object in-memory and paste it into the file so it may not be update. You can check it by making a first dump D1 aftert the first chunk, and compare with size with another dump made at the end. Your dump must be done when you need it. As you are reading your file in one pass, you should dump it after to save time. – RPresle Jun 10 '15 at 07:57
  • Also after dumping the file at the end, while predicting I need to again fit_transform the complete file using a classifier(again cause memory error) otherwise it gives unmatched features error. – Jatin Bansal Jun 10 '15 at 08:17
  • Dumping the model won't free the memory, so if you don't need a save of you learning phase, it's not required. Once you've done the partial_fit on all the file, you shouldn't do another fit because it would erase what you've done before. Please complete you question in and edited part to add your new code and the error, so we can find a way to correct it. – RPresle Jun 10 '15 at 08:23